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ABSTRACT 

Given  a  sequence  of  observations,  has  a  change  occurred  in  the  underlying  probability 
distribution  with  respect  to  observation  order?  How  well  can  such  a  change  be  detected  if  the 
sequence  is  being  monitored  in  real-time?  The  problem  of  detecting  change,  and  detecting  it 
with  minimal  delay,  is  an  important  one  in  a  wide  variety  of  real-world  situations.  For  example, 
one  might  monitor  a  complicated  multivariate  system  (such  as  a  military  helicopter  or  a  human 
being)  with  the  goal  of  detecting  subtle  change  in  order  to  provide  advance  warning  of  system 
failure. 

Change-point  problems  may  be  classified  as  “online”  or  “offline.”  In  offline  problems, 
all  data  under  consideration  are  on-hand  at  the  time  of  analysis  and  the  goal  is  to  determine  if, 
and  perhaps  when,  a  change  occurred  in  the  observation  sequence.  This  leads  to  problems  if  a 
fatal  result  is  encountered  in  the  middle  of  the  process  from  which  data  is  being  collected 
because  it  is  too  late  for  detection  to  do  any  good.  In  online  problems,  data  are  collected  in  real¬ 
time  with  the  goal  of  identifying  a  change  as  soon  as  possible  after  it  occurs  and  thus  as  far  as 
possible  in  advance  of  death. 

This  project  explores  nonparametric  graph-theoretic  approaches  to  solving  online  change- 
point  problems.  The  foundation  for  our  methodology  is  the  Ensemble  Sum  of  Pair-Maxima 
(ESPM)  Test,  a  powerful  offline  test  developed  by  Ruth  and  Koyak  (201 1).  Our  work 
investigates  the  efficacy  of  the  ESPM  Test  in  a  variety  of  offline  settings,  and  ultimately  extends 
that  test  to  online  settings  through  a  novel  modification  of  recently  developed  multiple  testing 
procedures  designed  to  control  false  discovery  rate.  When  tested  against  simulated  and  pseudo 
real-world  data,  this  modified  procedure  maintains  the  desired  overall  test  level  while  achieving 
impressive  power  and  useful  advanced  warning  times  in  many  scenarios.  This  method  is  not 
limited  to  the  ESPM  test  and  holds  much  promise  for  adapting  other  powerful  offline  techniques 
to  online  scenarios. 


KEYWORDS:  Online  change  detection;  Ensemble  Sum  of  Pair-Maxima  test  (ESPM); 
Graph-theoretic  procedure;  Nonbipartite  matching;  Benjamini-Hochberg  Procedure;  Ealse 
Discovery  Rate 
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I.  INTRODUCTION 

Life  is  all  about  making  decisions.  Many  decisions  are  made  as  a  result  of  observing 
change.  A  bus  driver  stops  at  a  traffic  light  because  it  has  turned  from  green  to  red;  a  stock 
broker  decides  to  sell  a  majority  of  his  positions  due  to  a  change  in  the  markets;  a  child  grabs  a 
snack  because  he  has  become  hungry;  an  alarm  clock  goes  off  to  wake  someone  up  because  the 
correct  amount  of  time  has  passed;  a  linguistics  computer  program  awards  credit  to  a  student  for 
correctly  inflecting  his  voice.  People  (or  things)  that  are  better  able  to  identify  a  change  will  be 
better  equipped  to  make  decisions.  Humans  are  naturally  good  at  recognizing  changes  limited  in 
complexity  which  are  of  a  large-scale  and/or  predictable  like  a  traffic  light.  However  in  the  real- 
world,  clear-cut  situations  like  this  are  not  common.  The  real-world  is  intricate  and  at  times  very 
ambiguous.  The  ability  of  humans  to  discern  subtle  change  across  a  complex  system  with  many 
variables  and  do  so  in  advance  of  a  negative  result  is  very  difficult  and  imperfect.  This  project 
seeks  to  further  develop  this  foresight  in  complex  real-world  systems. 

The  ability  to  detect  change  in  a  stochastic  process  is  a  central  problem  in  statistics  that 
has  great  practical  importance.  Consequently,  robust  solutions  to  this  problem  are  highly  sought 
after  and  widely  used  throughout  the  world.  For  example,  consider  these  four  real-world 
situations  where  change  detection  is  used  to  make  better  decisions  starting  with  a  simplified  one- 
variable  situation: 

-  Univariate  Quality  Control.  A  tire  factory  wants  to  ensure  that  the  tires  it  produces  are 
of  the  highest  quality  and  will  not  degrade  too  quickly.  Quality  control  supervisors  measure 
the  tread  depths  of  tires  rolling  off  the  production  line  and  apply  change  detection  methods  in 
order  to  prevent  tires  from  being  made  improperly.  They  hope  to  detect  subtle,  yet  significant, 
changes  that  foreshadow  imminent  problems  with  tread  depth  production  quality  so  they  can 
be  prevented. 

-  Multivariate  Quality  Control.  Over  time,  the  same  tire  factory  wanting  to  improve  upon 
its  quality  control  method  decides  that  tracking  and  analyzing  tread  depth  measurements  is 
insufficient  to  ensure  quality  tire  production.  Therefore,  in  addition  to  tread  depth,  features 
such  as  sidewall  thickness,  inner  diameter,  and  aspect  ratio  are  also  measured.  These  four 
possibly-related  numbers  taken  together  may  be  compared  to  the  measurements  collected 
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from  other  tires  produced.  Now  it  may  be  possible  to  detect  more  subtle  process  deviations, 
based  on  individual  values  and  relative  values. 

-  Multivariate  Machine  Health.  The  Marine  Corps  wants  to  ensure  its  helicopters  are 
being  maintained  and  repaired  before  accidents  occur.  During  operation,  a  helicopter  vibrates 
due  to  its  mechanical  design  and  aviation  maintenance  experts  determine  that  the  best  way  to 
track  helicopter  health  is  through  tracking  its  operating  vibration  frequency.  Aircraft 
maintainers  record  frequencies  with  respect  to  time  during  flight  at  various  locations 
throughout  the  aircraft.  When  an  aircraft  returns  from  a  flight,  the  recorded  multivariate  data 
is  analyzed  for  subtle  evidence  of  health  degradation  to  determine  if  repairs  are  necessary  to 
prevent  future  mechanical  failure. 

-  Multivariate  Biosurveillance .  Health  officials  want  to  anticipate  (and  possibly  deter) 
disease  outbreaks.  These  situations  represent  significant  changes  from  the  normal  health  of  a 
population  and  so  might  seem  easy  to  detect,  but  the  complex  world  of  human  health  is  home 
to  many  subtle  changes  invisible  to  the  naked  eye.  Doctors  through  the  use  of  change 
detection  methods  are  able  to  monitor  many  variables  on  public  health  at  once  and  detect 
when  subtle  changes  in  certain  measurements  signal  high  risk  of  or  even  foretell  an  outbreak 
of  contagious  disease  and  vaccinate  properly. 

These  few  examples  only  provide  a  quick  glimpse  at  a  small  portion  of  problems  where 
change  detection  is  important:  others  include  image  analysis,  structural  damage  assessment, 
crime  investigation,  and  environmental  field  analysis.  Our  work  offers  a  new  tool  for  change 
detection  that  can  be  employed  in  real-time  in  very  general  multivariate  settings. 

In  the  formal  mathematical  study  of  change  detection,  the  problem  of  detecting  a  change 
in  a  stochastic  process  is  known  as  a  “change-point  problem.”  Each  observation,  X,  of  this 
process  is  said  to  be  drawn  from  an  underlying  probability  distribution,  F .  The  change  point,  X^, 
where  q)  G  (2,  3, ... ,  i]  refers  to  the  first  point  in  a  sequence  of  observations  (Xi,  X2, ... ,  Xj)  at 
which  the  underlying  probability  distribution  for  that  observation  differs  from  that  of  prior 

observations (Fi  =  F2  =  =  F^_i  A  F^). 

Within  change-point  problems,  one  may  differentiate  between  offline  and  online 
problems.  In  offline  problems,  testing  for  change  is  done  only  at  a  predetermined  point  when  the 
system  is  turned  off;  no  new  data  are  collected  as  change-testing  occurs.  Such  an  approach  is 
inadequate  if  a  fatal  change  occurs  in  the  middle  of  a  process.  In  online  problems,  testing  for 


6 


change  occurs  as  data  are  collected.  Generally,  multiple  tests  for  change  are  conducted  in 
sequence  that  as  much  data  about  system  performance  as  possible  is  considered  for  each  test  and 
change  is  detected  in  time  for  the  appropriate  decisions  to  be  made. 

The  nature  of  such  changes  may  be  quite  general.  For  instance,  change  could  occur  as  a 
jump  in  data  distribution  mean  at  one  point  in  time  as  seen  for  a  univariate  example  in  Figure  1. 


•  Observation 

3 

2 

■  •  .  3 . . . V.—; . . 

0 

. .l;j ■  ■  -■  ■  y  •• 

•1 

-3 

20  40 

60  BO  IDO  120  1C  leo  180  200 

Observation  sequence  label,  i 

Figure  1:  Jump  in  distribution  mean  at  point  101  of  200 

But,  change  may  also  occur  as  a  gradual  drift  in  data  distribution  mean,  or  change  may  occur  as  a 
jump  or  gradual  drift  in  other  parameters  of  the  underlying  probability  distribution  such  as 
variance,  scale,  shape,  et  cetera  as  seen  for  another  univariate  example  in  Figure  2. 
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•2 

—  ^ 
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Figure  2:  Drift  in  distribution  variance  at  point  161  of  200 

Simplistic  in  nature,  the  previous  examples  consider  only  a  single  variable.  Figure  3 
shows  five  dimensions  of  200  sequential  observations  on  some  system.  Prior  to  a  change-point 
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at  101,  the  observations  (plotted  in  black)  follow  a  particular  multivariate  distribution;  after  the 
change  point,  the  observations  (plotted  in  red)  follow  a  different  multivariate  distribution.  This 
change  is  by  no  means  obvious  to  the  eye  (in  fact,  a  jump  change  in  the  distribution  mean  of 
magnitude  1  occurs  at  the  change  point);  we  seek  a  test  that  effectively  detects  such  changes  as 
soon  as  possible  after  they  occur. 


Figure  3:  Dimensional  breakdown  of  200  observations  of  5-variate  data 

The  problem  of  change  detection  has  a  long  history.  In  the  relatively  recent  era  of  “big  data,” 
multivariate  methods  are  of  great  interest.  Most  current  approaches  involve  fairly  strong 
assumptions  about  the  distribution  of  the  monitored  observations.  When  the  assumptions  fail  to 
be  met,  these  approaches  often  suffer.  Also,  many  existing  techniques  test  only  for  a  specific 
type  of  change  (such  as  an  abrupt  jump  in  distribution  mean),  or  require  the  change -point  to  be 
pre-determined.  Such  constraints  limit  the  extent  to  which  these  test  may  be  applied  to  real-world 
situations. 

In  this  project,  we  specifically  consider  the  problem  of  detecting  change  in  the  underlying 
distribution  of  a  sequence  of  observations  based  on  monitoring  the  observations  online,  with  the 
dual  objectives  of:  1)  identifying  correctly  that  a  change  has  occurred,  and  2)  maximizing  the 
time  between  when  a  change  is  detected  and  the  time  when  a  negative  result  will  occur  (i.e. 
machine  death  through  mechanical  failure)  or  a  positive  chance  squandered  (not  buying  stock 
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early  enough  to  take  advantage  of  a  rise  in  the  stock  market).  Detected  change  is  only  useful  if  it 
is  detected  in  time  to  be  useful  in  a  decision-making  process. 

We  are  interested  in  finding  a  way  to  detect  change  -perhaps  subtle  change  of  a  quite 
general  nature  -  within  multivariate  data  without  any  a  priori  assumptions  about  underlying 
probability  distributions:  that  is,  ours  is  a  nonparametric  test.  Our  approach  is  graph-theoretic 
and  relies  upon  the  idea  of  matching,  which  involves  pairing  observations  together  based  on 
interpoint  distances.  As  a  starting  point  in  this  project,  we  extend  a  recently  developed  offline 
nonparametric  change  detection  test  formulated  by  Ruth  and  Koyak  (201 1)  for  use  in  online 
situations. 

Our  work  towards  these  ends  is  organized  as  follows:  In  Section  II,  we  classify  sequential 
change-point  problems  into  two  categories,  offline  and  online,  with  discussion  on  how  each  type 
of  problem  relates  to  real-world  scenarios  and  is  framed.  Because  our  new  test  is  an  extension  of 
the  Ensemble  Sum  of  Pair-Maxima  (ESPM)  test  of  Ruth  and  Koyak  (201 1),  we  then  explore  the 
theoretical  underpinnings  of  that  through  a  review  of  relevant  key  literature  and  the  main  tenets 
of  graph-theoretic  matching.  In  Section  III,  we  investigate  the  change  detection  capabilities  of 
the  ESPM  Test  in  multivariate  offline  settings  by  varying  both  the  amount  of  data  available  for 
testing  and  the  location  of  the  change  within  the  data.  This  section  culminates  with  an  extension 
of  the  ESPM  test  to  multivariate  online  situations  using  multiple  testing  theory  and  the 
Benjamini-Hochberg  Procedure  for  controlling  the  Ealse  Discovery  Rate.  Section  IV 
demonstrates  the  performance  of  the  “telescope”  multiple  testing  format  among  various  change 
locations  through  a  simulation  study  built  to  mimic  online  scenarios.  In  Section  V,  we  apply  our 
test  to  the  (simulated)  real-world  data  from  the  2008  Prognostics  and  Health  Management 
Challenge  (PHM)  and  speak  to  the  characteristics  and  challenges  of  real-world  data  sets  in 
regards  to  graph-theoretic  tests.  Of  central  importance  is  the  introduction  of  “horizons”  to  extend 
the  Benjamini-Hochberg  Procedure  to  online  situations.  Section  VI  summarizes  all  our  findings 
and  highlights  opportunities  for  further  work  within  the  field  and  on  the  ESPM  test  in  particular. 
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II.  PROBLEM  BACKGROUND 

A.  PROBLEM  FORMULATION 

Given  a  multivariate  stochastic  process  can  we  detect  departures  from  homogeneity  in 
real  time?  In  other  terms,  can  we  tell  if  the  underlying  probability  distribution  from  which  a 
sequence  of  multivariate  observations  is  drawn  has  changed!  Due  to  the  widespread  and 
unceasing  nature  of  change,  these  same  questions  arise  in  many  real-world  applications  and  are 
commonly  referred  to  as  the  “change-point  problem”  as  previously  stated. 

1.  Change  Points 

In  the  field  of  change  detection,  the  term  change  point  may  be  defined  as  follows:  Given 
a  sequence  of  random  vectors  (X^,  X2, ,  \n)  V  i  G  {1, ... ,  iV},  let  Dj  represent  the  probability 
distribution  of  Xj.  The  change  point  q)  G  {2, ... ,  iV}  is  the  first  point  in  the  sequence  starting  from 
the  left  where  Di  In  our  setting,  once  the  initial  change  occurs,  there  is  no  return  to  the 

original  distribution.  For  example,  the  distribution  change  at  q)  could  be  the  result  of  an  abrupt 
mean  change,  or  “mean  jump”,  where  —  D2  —  ■■■  —  Dfp_^  D^p  =  D(p+-^  —  It  is 

also  possible  that  Dj  varies  with  i  for  i  >  ^:  for  instance,  the  distribution  change  beginning  at  q) 
could  be  a  gradual  mean  change,  or  “mean  drift”,  that  is  implemented  incrementally  and 
described  as  +  z(i  —  ^  -I- 1)  where  p^,  is  the  average  value  of  and  each 

component  of  z  is  the  rate  at  which  the  associated  mean  component  changes.  More  complex 
forms  for  Dj>(p,  like  those  intrinsic  to  real-world  scenarios,  are  allowed  as  well. 

In  a  hypothesis-testing  context,  the  general  change -point  problem  with  respect  to 
observations  X^,  X2, involves  defining  the  null  hypothesis 

(2.1)  Ho-.D^^D2  =  -^Dr, 

against  the  alternative  hypothesis 

(2.2)  D-i  —  D2  —  —  D(p-i  Dfp 

and  Dj>^  ^  0^3  q)  E  {2, ... ,  N]  and  Vj  G  {q), ... ,  N]. 
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As  a  result  of  its  general  nature,  this  alternative  hypothesis  fails  to  be  inclusive  of  every  possible 
change  situation.  Its  framework  is  suitable  for  many  change  situations,  but  does  not  cover  brief 
departures  from  homogeneity  such  as: 

D-i  —  D2  —  ==  D(p_i  —  —  •••  —  Dj^  A 

or  periodic  departures  from  homogeneity  that  cause  there  to  be  a  cyclic  pattern  of  change  away 
from  the  original  distribution  for  an  interval  of  time  and  then  change  back  to  the  original 
distribution  for  an  interval  of  time.  These  scenarios  can  develop  when  performing  image  analysis 
to  detect  short-lived  change  like  a  car  passing  through  an  area  or  a  brief  hand  command  to  a 
robot.  Our  new  test  is  not  designed  to  find  change  in  these  situations.  It  is  designed  to  find 
change  in  situations  where  prior  to  the  change  point  cp  all  observations  come  from  the  same 
initial  distribution  and  then  once  at  the  change  point  all  observations  come  from  distributions 
different  from  the  initial  one. 

2.  Dichotomy  of  Change-Point  Problems 

Various  possible  taxonomies  can  be  used  when  studying  change -point  problems.  Our 
work  naturally  separates  change -point  problems  into  two  main  categories:  online  and  offline. 

This  system  centers  on  the  difference  in  how  new  observations  are  incorporated  into  testing.  For 
online  problems,  data  collection  and  testing  occur  concurrently.  As  new  observations  are  added 
to  the  data  set,  the  null  hypotheses  are  tested  repeatedly  in  the  following  manner: 

1)  With  N  —  1  observations  (X^,  X2, collected  and  available,  add  Xj^ . 

2)  Test  for  a  change-point  ^  G  {2, ... ,  iV} 

a.  If  a  change-point  is  detected,  then  take  appropriate  action. 

b.  If  not,  then  return  to  step  (1)  with  (X^,  X2, ... ,  X;^)  available. 

Classic  cases  of  online  problems  include  the  monitoring  of  operating  mechanical  systems,  where 
system  measurements  believed  to  indicate  system  health  are  sampled  while  the  system  is 
operating  and  tested  in  sequence  to  identify  if  change  has  occurred  in  the  process.  The  goal  is  to 
detect  change  in  time  to  be  useful  in  preventing  a  negative  result  (i.e.  machine  death)  or  a 
positive  opportunity  from  being  squandered  (i.e.  bullish  stock  market).  Because  there  may  be  no 
limit  to  how  large  the  set  of  observations  will  be  if  and  when  a  change-point  is  detected  though. 
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the  set  of  observations  has  the  potential  to  become  prohibitively  large  and  therefore  difficult  to 
work  with  depending  on  the  specific  scenario. 

Alternatively  in  ojfline  problems,  all  data  are  collected  in  advance  of  any  testing;  that  is, 
testing  is  conducted  upon  a  finite  sequence  of  observations  whose  length  is  known  prior  to 
testing.  Observations  are  not  collected  during  the  testing  period  and  vice  versa.  Ojfline  problems 
can  be  further  delineated  into  cases  of  Two  Sample  Tests  or  Simultaneous  Tests.  For  a  Two 
Sample  Test,  the  change-point  q)  G  {2, ... ,  iV}  tested  for  in  the  sequence  of  multivariate 
observations  (Xi,X2,  is  known  or  assumed.  The  sequence  of  observations  is  then  split 

into  two  samples  at  this  predetermined  point,  (X^, ...  ,X^_i]  and  (X^, ...  ,X;^;],  and  then  the 
analyst  tests  the  null  hypothesis  Hq\  —  D2  =  ■■•  =  against  the  alternative  hypothesis 

=  D2  =  ■■■  =  Dfp_^  ^  Dfp  =  =  •••  =  Dj^.  A  well-known  Two  Sample  Test  is  the 

univariate  Kolmogorov-Smirnov  Test  for  Distributional  Homogeneity,  though  it  is  not  normally 
associated  with  change-point  problems.  This  nonparametric  test  regularly  used  by  scientists 
examines  the  maximum  difference  between  the  empirical  distribution  functions  of  associated 
data  sets  in  order  to  determine  if  they  are  both  drawn  from  the  same  distribution.  An  example  of 
this  problem  type  is  seen  in  clinical  trials  where  two  groups  of  subjects  are  drawn  from  the 
general  population.  The  first  is  designated  a  control  group  and  given  a  placebo,  while  the  other  is 
designated  a  test  group  and  administered  a  treatment.  The  groups  are  then  studied  to  determine 
whether  or  not  the  treatment  has  had  some  kind  of  effect.  As  a  whole,  these  tests  are  not  well- 
built  for  real-world  use  beyond  scenarios  where  the  change  point  is  easily  assumed.  If  the 
predetermined  change-point  chosen  by  the  analyst  unknowingly  differs  from  the  unknown  actual 
change-point  by  a  significant  amount  or  by  a  couple  crucial  data  points,  the  test  could  easily 
misidentify  change  when  there  is  none  or  fail  to  detect  when  there  actually  is.  In  real-world 
situations,  the  actual  change-point  is  often  nearly  impossible  to  identify.  The  need  to  make 
assumptions  can  cause  the  effectiveness  of  this  test  to  suffer  heavily.  As  a  consequence,  an 
extremely  attractive  feature  of  the  ESPM  test  is  its  nonparametric  nature.  It  does  not  require  any 
limiting  assumptions  to  be  made  concerning  the  underlying  probability  distribution  of  the  data  or 
the  predetermination  of  a  change  point  for  it  to  be  applied.  This  universality  cements  it  as  a 
strong  foundation  for  solution  approaches  to  numerous  types  of  change-point  problems  including 


ours. 
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For  simultaneous  tests,  there  is  no  predetermined  change-point  q).  Instead,  the  test  is 
performed  upon  the  entire  sequence  of  observations  (Xi,X2,  available;  thus,  all  possible 

change-points  are  tested  simultaneously.  The  null  (2.1)  and  alternative  (2.2)  hypotheses  tested  in 
this  problem  are  as  stated  in  the  previous  subsection.  From  a  utility  standpoint,  the  absence  of  an 
assumption  concerning  change -point  location  makes  these  tests  perfect  candidates  for  application 
to  such  real-world  situations,  where  change  points  are  not  known  in  advance.  One  example  of 
such  a  test  is  environmental  image  analysis.  The  abuse  natural  vegetation  sustains  as  a  result  of 
training  activities  is  a  significant  issue  on  many  military  installations.  Consider  a  large  tract  of 
land  on  a  military  base  monitored  a  by  satellite  specially  designed  for  ground-level  imagery 
where  each  image  taken  of  the  land  represents  a  multivariate  observation  and  each  pixel  within 
an  image  represents  a  variable.  Natural  resource  managers  are  unable  to  measure  key  statistics 
like  land  cover  physically  so  they  rely  on  innovative  environmental  management  and  monitoring 
tools  to  identify  areas  of  land  cover  change  through  the  use  of  satellite  imagery  during  their 
periodic  check. 

As  mentioned  previously  in  Section  I,  the  goal  of  this  research  is  to  extend  the  ESPM  test 
developed  by  Ruth  and  Koyak  (201 1),  which  is  offline,  nonparametric,  and  simultaneous,  for  use 
in  online  problems.  Consequently,  our  research  has  focused  on  conducting  successful 
simultaneous  testing  in  online  situations.  A  survey  of  literature  in  the  field  of  change  detection 
reveals  that  there  are  few  powerful  nonparametric  simultaneous  tests  for  multivariate  change- 
point  problems,  particularly  online  problems.  We  now  proceed  to  review  various  graph-theoretic 
approaches  to  change  detection  concluding  with  a  discussion  about  how  these  approaches 
comprise  the  theoretical  origin  of  the  ESPM  test.  This  will  lay  the  foundation  for  the  new  online 
extension  of  the  ESPM  test  in  the  next  chapter. 

B.  GRAPH-THEORETIC  APPROACHES  TO  CHANGE  DETECTION 

Graph-theoretic  ideas  provide  an  innovative  approach  to  change -point  problems.  Of  late, 
methods  using  graph-theoretic  ideas  such  as  minimum  spanning  trees,  nearest-neighbor 
algorithms,  and  clustering  methods  have  proved  interesting  to  many  due  to  advances  in 
computational  capacity  which  makes  them  realistic  to  implement.  The  ESPM  test  is  itself  based 
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heavily  upon  matching.  To  fully  understand  what  matching  is,  one  must  know  something  about 
the  basic  definitions  and  background  of  graph  theory,  so  that  is  where  we  begin. 

1.  Graph  Theory 

These  definitions  come  from  Chartrand  and  Zhang  (2005).  A  graph  is  an  ordered  pair 
G  =  (y,E)  consisting  of  a  finite  nonempty  set  of  vertices  V  connected  by  edges  e,  which  are 
two-element  unordered  subsets  of  V.  A  graph  Gi  =  (I4,  £'i)  is  called  a  subgraph  of  G  =  (V,  E) 
ifViEV  and  E^  E  E;  if  V-^  =  V,  then  is  a  spanning  subgraph  of  G.  Two  distinct  vertices, 
and  V2,  are  adjacent  vertices  if  they  are  joined  by  an  edge  {v^,  V2}.  Two  distinct  edges  are 
adjacent  edges  if  they  share  a  vertex  such  as  {v^,  V2}  and  (1^2-  A  complete  graph  is  a  graph 
in  which  all  vertices  are  adjacent.  An  undirected  graph  is  one  in  which  the  edges  have  no 
orientation-  that  is,  edge  {v^,  V2}  is  identical  to  {V2,  r’l);  a  directed  graph  is  one  in  which  edges 
have  a  direction  associated  with  them  making  edge  {vj^,  V2}  distinct  from  edge  {V2,  r’l).  Vertex 
and  edge  V2}  can  be  referred  to  as  incident  with  each  other,  and  the  degree  of  vertex 
is  the  number  of  edges  incident  with  . 

Au  —  V  walk  in  graph  G  is  a  sequence  of  vertices  in  G  beginning  with  u  and  ending  with 
V  such  that  consecutive  vertices  within  the  sequence  are  adjacent;  ifu  =  v,  then  the  walk  is 
closed.  A  walk  in  which  no  edge  is  used  more  than  once  is  called  au  —  v  trail.  A  circuit  is  a 
closed  trail  that  includes  at  least  three  distinct  vertices;  a  circuit  that  repeats  no  vertex  except  the 
first  and  last  is  a  cycle.  If  there  is  a  u  —  i;  walk  for  every  pair  of  vertices  in  graph  G,  then  G  is 
said  to  be  connected. 

A  graph  G  is  called  acyclic  if  it  has  no  cycles  and  is  a  tree  if  it  is  both  acyclic  and 
connected.  A  spanning  tree  of  G  is  a  spanning  subgraph  that  is  also  a  tree.  If  a  real  number 
expressing  some  form  of  interpoint  cost  is  assigned  to  each  edge  in  G,  then  G  becomes  a 


Figure  4:  Undirected  complete  graph  on  5  vertices; 
subgraph  highlighted  in  red  is  a  spanning  tree 
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Figure  5:  Directed  graph  with  cycle 
(C,D,E,F)  highlighted  in  blue 

weighted  graph  and  the  sum  of  all  edge  weights  known  as  the  weight  of  the  graph.  The 
spanning  tree  of  weighted  graph  G  whose  weight  is  the  least  among  all  possible  spanning  trees  is 
the  minimum  spanning  tree  (MST)  of  G.  Figures  4  and  5  illustrate  some  of  these  terms.  From 
this  point  forward,  every  graph  discussed  is  a  complete  undirected  graph. 

In  a  pioneering  paper,  Friedman  and  Rafsky  (1979)  considered  various  change-detection 
test  statistics  based  on  the  relational  information  contained  within  MSTs  in  order  to  test  if  two 
samples  came  from  the  same  distribution.  They  begin  with  two  sets  of  observations,  and  S2, 
and  define  V  —  U  ^2 .  Next,  an  MST  is  constructed  with  respect  to  some  interpoint  cost 
function  on  V.  They  then  remove  each  edge  in  the  MST  which  connects  a  point  in  to  a  point  in 
S2,  and  count  the  number  of  disjoint  trees  created  as  a  result  of  edge  removal.  This  count  tends  to 
be  lower  when  and  S2  come  from  different  distributions,  although  this  test  is  not  particularly 
powerful. 

2.  Matching 

To  begin  our  review  of  matching-based  approaches  to  change  detection,  we  present  a  few 
additional  definitions  to  help  develop  later  ideas.  A  subset  of  edges  G  F  is  independent  if  no 
two  edges  in  F^  are  adjacent.  A  matching  in  a  graph  G  =  (V.E)  is  an  independent  set  of  edges 
in  G.  The  amount  of  possible  matches  in  a  graph  depends  upon  its  size.  A  maximum  matching 
in  G  is  a  matching  that  has  at  least  as  many  edges  as  any  other  potential  matching  in  G .  Since  the 
ESPM  test  utilizes  maximum  matchings  and  our  new  test  is  an  extension  of  it,  all  matchings 
discussed  from  this  point  forward  in  the  paper  are  maximum  matchings.  A  perfect  matching  in 
G  is  a  matching  that  includes  every  vertex  in  G.  Perfect  matchings  are  inherently  maximum 
matchings,  although  not  vice  versa;  additionally,  they  are  only  possible  if  a  graph  has  an  even 
number  of  vertices. 
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Matching-based  approaches  have  been  used  in  a  variety  of  problem  areas  such  as  organ 
donation  and  large-scale  logistics.  These  methodologies  commonly  seek  to  find  a  matching 
which  minimizes  some  contextually  relevant  interpoint  cost  function.  In  these  problems,  two 
main  types  of  matching  occur:  bipartite,  where  the  vertices  of  the  graph  are  split  into  two  unique 
subsets  of  observations,  5^  and  ^2,  and  each  edge  of  the  graph  consists  of  a  vertex  from  each 
subset  (a  vertex  can  only  be  paired  with  a  vertex  from  the  other  subset),  and  nonbipartite,  where 
matching  does  not  depend  on  any  previous  partitioning  of  the  vertices  (a  vertex  can  be  paired 
with  any  vertex  other  than  itself).  Each  type  of  matching  has  a  different  effect  upon  which  edges 
make  up  the  minimum- weight  matching. 

For  this  research,  we  take  interest  in  minimum- weight,  or  minimum-cost,  nonbipartite 
matchings  (MNBMs).  The  general  interpoint  cost  function  used  in  change -point  problems  is 
defined  by  Ruth  and  Koyak  (201 1)  as  follows:  Given  sample  space  5,  c:  5  x  5  [0,  oo)  is  a  cost 

function  if  it  satisfies 


(2.3) 

c(x,  x)  =  0  V  X  G  5 

and 

(2.4) 

c(x, y)  =  c(y,  x)  V  X,  y  G  S. 

Let  Cij  denote  the  cost  c(Xj,Xy),  or  in  more  general  terminology  the  weight  of  the  edge 
connecting  Xj  and  Xj.  The  function  c  will  be  referred  to  as  a  distance  function  if  it  satisfies  the 
triangle  inequality 

(2.5)  c(x,  z)  <  c(x,  y)  -I-  c(y,  z)  V  x,  y,  z  E  S 

in  addition  to  (2.3)  and  (2.4).  This  general  definition  gives  the  flexibility  needed  to  accommodate 
all  types  of  data  (discrete  and  continuous,  univariate  and  multivariate,  etc.).  However,  this  does 
not  mean  there  is  no  need  for  adjustment.  Interpoint  cost  functions  should  generally  be  tailored  to 
the  context  of  the  problem  and  the  pertinent  data  types.  In  nonbipartite  minimum  matching,  the 
assignment  of  pairs  relies  entirely  upon  the  choice  of  cost  function.  Context  might  clearly  point 
towards  one  measure  of  cost  over  others;  it  might  require  the  choice  to  become  a  matter  of 
debate.  Commonly  used  cost  functions  include  computing  interpoint  Euclidean  distance  or 

(2.6)  dff  =  JiXi-Xj)'  (Xi-Xj), 

Mahalonobis  distance,  in  order  to  account  for  measurement  scale  and  correlation  between  data 
components 
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(2.7)  =  J(Xi-Xjfs-^(Xi-Xj) 

where  S  is  the  covariance  matrix,  and  Manhattan  distance,  but  there  are  many  others. 

Rosenbaum  (2005)  developed  a  test  statistic  derived  from  MNBMs  to  compare  two 
multivariate  distributions.  Using  simple  interpoint  distances  to  construct  a  MNBM  of 
multivariate  observations,  his  cross-match  statistic  is  the  number  of  pairs  comprised  of  one 
observation  from  the  first  distribution  and  one  observation  from  the  second  distribution. 
Distributions  that  are  very  different  will  cause  few  cross-matches,  whereas  distributions  that  are 
very  similar  will  cause  many  cross-matches.  Impressively,  his  cross-match  statistic  has  a  known 
exact  distribution  and  is  nonparametric.  This  test  strongly  motivated  Ruth  and  Koyak  (2011)  in 
their  development  of  the  ESPM  test. 

In  other  related  work,  Lu  et  al.  (2001)  demonstrate  the  ability  to  use  optimal  nonbipartite 
matching  in  a  multivariate  real-world  scenario  and  achieve  solid  analysis  of  the  problem  in 
question.  Through  an  observational  study  on  the  media  campaign  against  drug  use,  optimal  non¬ 
bipartite  matching  was  used  to  pair  teenage  subjects  who  were  demographically  similar  but  had 
extremely  different  levels  of  exposure  to  the  media  campaign.  The  stated  intentions  of  the 
subjects  in  relation  to  illegal  drug  use  were  used  to  assess  the  effectiveness  of  the  campaign. 

Another  successful  application  of  optimal  nonbipartite  matching  appears  in  Lu  and 
Rosenbaum  (2004)  where  they  investigate  if  a  localized  minimum-wage  increase  is  at  all 
associated  with  depressed  low-wage  employment  rates  in  the  same  area.  Faced  with  the  problem 
of  needing  to  compare  one  test  group  to  two  control  groups,  they  convert  the  natural  tripartite 
problem  into  a  nonbipartite  matching  problem  and  provide  relevant  analysis  on  the  topic  in 
question. 

In  the  vein  of  minimizing  global  interpoint  cost  like  Friedman  and  Rafsky’s  (1979)  MST 
test  and  Rosenbaum’s  (2005)  cross-match  test,  the  cornerstone  of  the  ESPM  test  is  the 
computation  of  a  MNBM  to  pair  observations  so  as  to  minimize  the  sum  of  all  distances  between 
paired  observations.  Unlike  the  others  though,  the  ESPM  test  computes  many  MNBMs.  We 
elaborate  in  the  following  section. 
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3.  Ensembles 

Intuition  suggests  that  if  a  single  MNBM  contains  information  about  distributional 
homogeneity  in  a  sequence  of  observations,  then  additional  information  might  be  contained  in 
additional  matchings  where  each  successive  matching  is  the  next-best  matching  independent  of 
the  first.  This  idea  has  been  thoroughly  explored  by  many  people  and  found  to  be  true.  More 
specifically,  the  power  of  graph-theoretic  tests  is  known  to  be  enhanced  by  considering 
collections  of  orthogonal  subgraphs  called  ensembles.  Friedman  and  Rafsky  (1979)  originally 
suggest  that  ensembles  of  MSTs  with  k  «  N /2  orthogonal  matchings  be  used  to  refine  the 
sensitivity  of  their  multivariate  runs  test  where  N  is  the  number  of  observations  in  the  data  set.  In 
examining  that  same  test,  Friedman  and  Rafsky  (1979)  show  it  has  higher  power  when  used  in 
higher  dimensions,  but  more  importantly  augments  general  test  power  by  computing  their  test 
statistic  on  an  ensemble  of  orthogonal  MSTs,  where  two  MSTs  are  orthogonal  if  they  do  not 
share  any  edges  in  common. 

Although  the  ESPM  test  makes  use  of  a  different  type  of  subgraph  (a  matching),  Ruth 
and  Koyak  (2011)  make  use  of  a  similar  idea  in  order  to  increase  test  power.  To  describe  their 
adaptation,  they  define  the  term  orthogonally  successive  optimal  matchings  (OSOMs)  to  refer  to 
matchings  constructed  through  the  following  process:  compute  an  optimal  nonbipartite  matching, 
then  find  the  next  best  matching  that  is  orthogonal  to  the  first,  then  the  next  best  matching  that  is 
orthogonal  to  both  the  first  and  second,  and  so  on  until  there  are  no  more  orthogonal  matches  left 
to  be  made.  Figure  6  below  displays  OSOMs  on  iV  =  6  points.  Each  color  represents  a  different 


Figure  6:  Orthogonal,  successive,  optimal,  non-bipartite,  maximum  matchings 

on  IV  =  6  points. 


18 


matching  where  the  best  is  blue,  then  red,  green,  purple,  and  black.  Orthogonality  is  shown  in  the 
graph  by  how  no  edge  is  reused  by  two  different  colors.  Given  N  —  2n  observations  and  an 
associated  interpoint  cost  function,  the  OSOM  procedure  guarantees  that  at  least  iV/2  matchings 
may  be  obtained  from  the  data  set. 

At  its  core,  the  ESPM  test  centers  on  the  idea  that  MNBMs  (more  than  one)  based  on 
interpoint  distances  will  tend  to  result  in  pairings  that  are  closer  in  sequence  order  than  would  be 
the  case  if  all  observations  came  from  the  same  distribution.  Until  now  though,  the  application  of 
this  idea  has  been  limited  to  offline  problems.  We  now  propose  a  new  test  that  is  simultaneous, 
multivariate,  and  distribution-free  which  will  extend  the  methodology  of  the  ESPM  test  for  use 
in  online  situations. 
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III.  TEST  DISCUSSION 


We  introduce  a  new  change  detection  test  for  use  in  online  change-point  problems.  As 
stated  previously,  this  new  test  builds  off  the  methodology  of  the  powerful,  nonparametric, 
multivariate,  offline  Ensemble  Sum  of  Pair-Maxima  (ESPM)  test  created  by  Ruth  and  Koyak 
(201 1).  Our  online  extension  applies  Multiple  Testing  Theory  through  a  novel  extension  of  the 
Ealse  Discovery  Rate  procedure  of  Benjamini-Hochberg  (1995). 

Eor  the  purpose  of  presenting  our  test,  assume  a  sequence  of  iV  =  2n  >  4  observations 
ordered  with  respect  to  time.  Our  goal  is  to  test  if  change  occurs  with  respect  to  this  ordering.  Eor 
instance,  change  may  be  a  jump  or  a  drift  in  some  distributional  parameter  at  some  unknown 
point  in  the  sequence.  The  requirement  that  N  be  even  is  not  strict,  but  it  simplifies  description  of 
the  test  and  we  later  describe  how  to  handle  odd  N  accordingly.  Eor  a  simultaneous  test  where  Fj 
is  the  underlying  probability  distribution  of  the  observation  (Xj)  within  this  sequence,  the  null 
hypothesis  of  distributional  homogeneity  asserts  that  Fi=F2  =  -=F„.  The  alternative 
hypothesis  asserts  that  there  exists  a  change  point  cp  E  {2, ...  ,N}  such  that  F^  =  F2  =  •••  = 

F^_i  A  As  in  the  case  of  the  ESPM  test,  we  compute  a  minimum  nonbipartite  matching  of  n 

pairs,  M  =  Xj^],  (Xj^,  Xj^], ... ,  {Xj^_^,  Xj^]|,  with  respect  to  some  cost  function.  The 

ordering  of  these  pairs  is  arbitrary.  We  use  Euclidean  distance  (2.6)  unless  otherwise  specified. 

As  in  Ruth  and  Koyak  (201 1),  let  “  hq-i  “  hq  be  the  sequence  labels  for  the 

pair  |Xj2^_^,  ^hq\-  example,  if  the  second  pair  in  M  is  (Xy,  X3},  then  q  =  2  and  so 

^hq-i  “  ^t3  =  ^3  =  7  and  “  ^4  =  4  =  3.  Now  order  each  individual  pair  as  (Uq,  Eg), 
where  Uq  and  Yq  are  correspondingly  the  minimum  and  maximum  value  of  the  ordering  variable: 

(3.1)  Uq^  min  Fj^  J  and  Yq  =  max  F^^  J,  where  q  G  {1, 2, ... ,  n). 

Continuing  the  q  =  2  example,  U2  =  3  and  Y2  =  7.  These  ideas  are  illustrated  in  the  context  of 
successive  matching  in  Eigure  7  and  Table  1  for  a  case  with  N  =  6  points.  With  this  foundation  in 
place,  we  are  now  ready  to  discuss  the  test  statistics  which  we  will  employ  for  sequential  testing. 
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Figure  7:  Three  successive  minimum  nonbipartite  matchings  on  yv  =  6  points. 


Table  1:  Maximum  and  minimum  values  of  pairs  from  matchings  shown  in  Figure  7 


First  Matching  (Blue) 

Second  Matching  (Red) 

Third  Matching  (Green) 

Pair 

Minimum 

Maximum 

Pair 

Minimum 

Maximum 

Pair 

Minimum 

Maximum 

(1,2)  1  2 

(3.5)  3  5 

(4.6)  4  6 

(1,4)  1  4 

(2,3)  2  3 

(5,6)  5  6 

(1,3)  1  3 

(2,6)  2  6 

(4,5)  4  5 

A.  ENSEMBLE  SUM  OF  PAIR-MAXIMA  TEST 

The  foremost  job  of  any  change  detection  test  is  to  properly  detect  change  regardless  of 
other  concerns.  Thus,  the  decision  of  which  test  statistic  to  use  as  the  determining  factor  of 
whether  or  not  change  has  occurred  is  of  paramount  importance.  A  desirable  test  statistic  is 
proportionally  sensitive  to  the  occurrence,  or  lack  thereof,  of  change;  bad  ones  are  completely 
insensitive  or  overly  sensitive  to  the  change  attempting  to  be  detected.  Our  new  test  is  powered 
by  the  three-level  test  statistic  used  in  the  ESPM  test  where  each  proceeding  level  is  used  to  help 
determine  the  current  level’s  numeric  value. 

1.  The  Sum  of  Pair-Maxima,  T yv 

If  the  alternative  hypothesis  is  true  -  that  is  some  form  of  distributional  change  has 
occurred  within  a  sequence  of  observations  -  it  is  expected  that  a  minimum  nonbipartite 
matching  based  on  interpoint  Euclidean  distances  would  pair  observations  that  are  closer 
together  in  sequence  than  would  be  the  case  under  the  null  hypothesis  of  no  change.  Drawing 
upon  this,  Ruth  and  Koyak  (201 1)  devise  a  test  statistic  based  on  summing  the  differences 


Yi  —  Ui  in  each  pair  and  rejecting  the  null  hypothesis  if  the  sum  is  less  than  some  critical  value. 
They  consider,  Tj^,  an  equivalent  test  statistic  labeled  as  the  Sum  of  Pair-Maxima  (SPM)  test 
statistic  built  upon  this  sum: 
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(3.2) 

This  is  equivalent  to  the  sum  of  sequence  label  differences  seen  by  the  identity: 

(3.3)  ItiTi  =in(2n  +  l)+ilti(Ti-l/i). 

The  null  hypothesis  is  rejected  for  small  values  of  this  sum. 

The  mean  and  variance  of  are  derived  using  Fristedt  and  Gray’s  (2004)  definition  for 
exchangeable  sequences  of  random  variables.  A  sequence  (Yi,  Y2, ... ,  Yj)  of  random  variables  is 
exchangeable  if  for  any  permutation  (3  of  indices  {1, 2, ... ,]}  the  joint  probability  distributions  of 

hi.i'2 . . This  definition  is  easily  fit  by  the  case  of  the  null 

hypothesis  since  the  ordering  among  the  pairs  in  the  matching  is  arbitrary.  There  are  N\  possible 
permutations  of  sequence  labels,  all  of  which  are  equally  probable,  and  the  random  variables 
(Ti,  Y2, ... ,  Yjf)  are  exchangeable.  This  leads  to  the  mean  of  Tj^  being  expressed  as: 

(3.4)  fiN  =E[T^]=n*  E[Yi]  = 
and  the  variance  as: 

(3.5)  =  Var[r„]  =  n  *  Var[Ti]  +  n(n  -  l)Cov[Ti,  T^]  = 

Theorem  1  of  Ruth  and  Koyak  (201 1)  shows  that  Tn  has  a  limiting  normal  distribution: 

(3.6)  P  <  t)  ^  as  N  ^  oo 

\  CTjv  / 

where  t  represents  all  real  numbers  and  0  is  the  standard  normal  cumulative  distribution 
function.  It  is  important  to  note  though  that  this  is  not  done  through  the  classical  central  limit 
theorems.  Despite  each  Tj  in  Tn  having  identical  marginal  distributions  because  they  are 
exchangeable,  they  are  not  independent;  therefore  (3.6)  cannot  be  proved  using  classical  central 
limit  theorems.  To  overcome  this,  (3.6)  is  proved  via  a  technique  known  as  Stein ’s  method. 
Stein’s  method  (Stein,  1972,  1986)  relies  on  a  differential  equation  that  describes  the  normal 
distribution  and  a  process  known  as  “coupling,”  by  which  auxiliary  random  variables  similar  to 


the  variables  being  studied  are  constructed.  The  results  establish  bounds  on  the  distance  from 
normality  for  selected  cases  of  dependency,  including  this  case.  Implementation  of  the  method 
and  proof  of  asymptomatic  normality  for  Tpj  is  seen  in  Section  III  of  Ruth  (2009). 
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The  normal  approximation  for  Tj^  is  improved  using  an  Edgeworth  expansion  to  diminish 
error  present  in  cases  where  n  is  small  or  moderately  sized.  This  procedure  approximates  the 
distribution  of  interest  by  starting  with  a  normal  distribution  and  then  adding  in  higher  order 
corrections  for  non-zero  moments  of  the  third  order  or  above.  In  Ruth  (2009),  the  third  central 
moment  of  Tpj  is  shown  to  be: 

(3.7)  Em  -  EnY]  =  E[T^]  -  -  En  ^  -n(n-l)(2n+l)(2n+3)  ^ 


For  all  n  >  1,  k:3  <  0.  Thus,  it  follows  that  Tj^  is  negatively  skewed  for  all  cases  of  interest.  The 
Edgeworth  approximation  is  then  given  by: 


(3.8) 


~  0(1)  + Co 

\  CTjv  / 


N  +  3 


w7(W-2)(W  +  1) 


(t^  -  l)e 


-t^f2 


where  Cq 


Vis 

126+2n 


0.06.  A  detailed  derivation  is  found  in  Section  III  of  Ruth  (2009). 


Many  theoretical  properties  of  interest  for  a  test  statistic  have  to  do  with  its  properties 
under  the  null  hypothesis.  In  contrast,  a  test  statistic  is  said  to  be  consistent  if  it  has  the  property 
that  its  power  approaches  I  as  sample  size  increases  without  limit  for  any  level  of  significance 
and  any  departure  from  the  null  hypothesis  (no  matter  how  small).  While  the  null  properties  of 
the  tests  such  as  SPM  may  be  straightforward  to  establish,  consistency  is  often  difficult  to  prove. 
The  consistency  of  the  generalized  runs  test  of  Friedman  and  Rafsky  (1979)  that  used  minimum 
spanning  trees  was  only  proven  by  Henze  and  Penrose  twenty  years  after  the  test  was  introduced. 
Rosenbaum’s  cross-match  test  has  only  been  proven  consistent  under  less  general  conditions. 
Without  specification  of  a  change  point,  Ruth  (2009)  showed  that  the  SPM  test  is  consistent 
against  general  jump  alternatives  under  the  same  conditions  for  which  the  cross-match  test  is 
consistent.  Simulations  conducted  during  our  research  suggest  that  the  SPM  statistic  is 
consistent  across  drift  and  other  alternatives  as  well. 


We  stated  earlier  that  the  requirement  for  N  to  be  even  was  not  strict.  If  the  sample  size  is 
odd,  nonbipartite  matching  will  leave  one  observation  unpaired.  In  order  to  overcome  this 
problem  without  leaving  out  the  last  observation,  Ruth  and  Koyak  (201 1)  create  a  dummy 
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observation  which  is  assigned  a  distance  zero  from  every  other  observation;  this  ensures  that 
every  observation  will  be  part  of  a  matching  pair.  These  situations  do  not  affect  the 
asymptomatic  normality  discussed  earlier,  and  the  mean  and  variance  are  given  by: 

(3.9)  liN  =  E[T^]  =  ^2  ^  ^  v(v-ixv+i)_ 

We  consider  a  short  graphical  example  of  the  Tj^  test  statistic  for  illustration  purposes. 
Figure  8  below  shows  a  randomly  generated  sample  of  iV  =  20  bivariate  observations  which  are 
plotted  by  their  sequence  numbers  and  then  summarized  further  below  in  Table  2.  By  design,  we 
drew  these  observations  from  unequal  distributions.  From  the  SPM  test,  we  calculate  Tpj  =  122. 
The  null  hypothesis  tells  us  that  the  expected  value  and  standard  deviation  of  Tj^,  given  by  (3.4) 

and  (3.5),  are  =  140,  aj^  =  =  6.48.  For  an  a  =  0.05  level  test,  the 

critical  value  stemming  from  (3.8)  is  129.  =  122  <  129  suggesting  that  the  underlying 

probability  distributions  of  the  sequential  observations  displayed  in  Figure  8  have  undergone 
some  sort  of  change,  as  indeed  is  the  case. 


Figure  8:  A  minimum  nonbipartite  matching  on  N  =  20  bivariate  observations. 
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Table  2:  SPM  test 
calculations  for  data 
shown  in  Figure  5 


Pair 

Yi, 

Maximum 

(1,5) 

5 

(2,9) 

9 

(3,7) 

7 

(4,8) 

8 

(6,13) 

13 

(10,11) 

11 

(12,14) 

14 

(15,17) 

17 

(16,18) 

18 

(19,20) 

n 

20 

'  Yi  =  122 

i  = 

2.  Cumulative  Sum  of  Pair-Maxima,  S yv,fc 

The  power  of  the  SPM  test  is  significantly  increased  through  the  use  of  a  collection  of 
orthogonal  matchings,  or  “ensemble”.  This  relies  upon  the  idea  that  if  a  single  matching  provides 
information  about  distributional  homogeneity  in  a  sequence  of  observations,  then  subsequent 
optimal  pairwise  orthogonal  matchings  on  the  same  observations  will  provide  additional 
information.  A  k-ensemble  of  matchings  is  considered  to  be  complete  if  /c  =  iV  —  1  orthogonal 
matchings  of  the  data  are  possible.  Complete  ensembles  have  many  practical  applications  such  as 
being  solutions  to  round  robin  scheduling  problems  for  a  sports  tournament. 

Ruth  and  Koyak  (201 1)  construct  these  ensembles  recursively;  that  is  they  compute  the 
first  MNBM  on  the  data,  then  compute  the  next  best  MNBM  on  the  data  that  is  pairwise 
orthogonal  to  the  first,  and  so  on  until  no  more  MNBMs  are  possible.  This  procedure  can  often 
fail  to  produce  a  complete  ensemble  of  MNBMs,  but  Anderson  (1972)  shows  that  it  will  always 
yield  at  least  a  half  ensemble  (k  =  iV/2),  which  is  sufficient  for  this  test. 

Each  of  these  matchings  has  a  sum  of  pair- maxima  statistic.  Let  Tn.  i  signify  the  sum  of 
pair- maxima  statistic  for  the  best  successive  orthogonal  matching.  The  n-ensemble  version  of 

the  SPM  test  is  then  expressed  as: 
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(3.10)  V/c  =  Sr=i7’N,i. 

We  define  as  the  cumulative  sum  of  pair-maxima  over  the  first  k  matchings.  Just  as  Tn  takes 
on  lower  values  under  the  alternative  hypothesis  than  under  the  null  hypothesis,  is  also 
expected  to  drop  below  its  mean  value  when  under  alternative  hypotheses.  Theorem  2  of  Ruth 
and  Koyak  (2011)  shows  that  each  Tn,  i  has  identical  univariate  marginal  distributions  and 
moments  up  to  the  second  order  when  under  the  null  hypothesis  of  homogeneity.  It  is  natural 
then  that: 


(3.11) 

and 


~  —  Zf=i£'[7w,t]  ■  —  k  *  fipj  —  k 


Cov(5„,^,5„,^)  =  (^)  (l  -  4. 


N(N  +  1) 


where  l<i;<w<iV  —  1  and  =  iV(iV  -I-  l)(iV  —  1)^/180.  An  expression  for  the  variance 
of  is  desired  in  order  to  find  its  exact  distribution,  but  hard  to  determine  due  to  its 
dependence  upon  the  covariance  between  Tj^  i,  which  is  also  difficult  to  determine  analytically. 
Simulation  has  suggested  that  Cov(Tj^  i,  =  CovQTj^  -j^,  7)v  2)  for  i  ^  A  so  for  the  sake  of 
analysis  this  is  assumed  to  be  true.  Under  this  assumption,  it  follows  that: 

(3.12)  =  Var[SNk\  =  - 


180 


where  the  underscore-tilde  denotes  that  the  equality  depends  upon  the  covariance  assumption. 


3.  Ensemble  Sum  of  Pair-Maxima,  Kjsj 

The  asymptotic  normality  of  Tj^  suggests  that  is  also  asymptotically  normal. 
Furthermore,  the  covariance  structure  of  is  the  same  as  that  of  a  Brownian  bridge.  This  leads 
to  a  choice  of  test  statistic: 


(3.13)  Kj^  —  max;jg(i_2 . w/2)  (  ^  cjv 

where  from  (3.11),  =  (iV  - 


The  exact  distribution  of  is  known  for  normal  Ruth  and  Koyak  (201 1)  call  this  the 
Ensemble  Sum  of  Pair-Maxima  (ESPM)  statistic.  They  demonstrate  through  simulation  that  ^N,k 
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does  not  appear  to  be  asymptomatically  normal  for  low  dimensions.  As  a  result,  the  exact  null 
distribution  of  Kpj  remains  unknown  and  is  an  open  problem.  Fortunately,  simulation-based 
critical  values  from  Ruth  and  Koyak  (201 1)  enable  the  use  of  Kpj  as  a  test  statistic. 

These  approximated  critical  values  for  Kpj  are  varied  for  N,  a,  and  the  dimensionality  of 
the  observations  being  tested.  They  can  be  found  in  Table  2  of  Ruth  and  Koyak  (201 1). 

4.  ESPM  Performance  Characteristics 

Part  of  our  proposed  work  was  to  independently  verify  ESPM  performance 
characteristics  (Table  3)  as  published  in  Ruth  (2009).  The  original  simulations  suggest  this  test  is 
powerful  against  a  wide-range  of  change-point  alternatives.  A  total  of  1000  samples  were 
generated  for  each  of  30  vignettes  which  vary  by  dimensionality  (p  =  5,  20),  type  of  change 
(jump  or  linear  drift),  multivariate  distribution  type  (normal,  normal  mixture,  or  Weibull),  the 
parameter  0  affected  by  change  (mean  vector,  covariance  matrix,  or  scale  parameter),  and  lastly, 
the  magnitude  of  change  (A).  In  each  vignette,  the  true  change  point  is  located  atk=  101.  For 
each  vignette  in  which  a  mean-change  takes  place,  all  samples  begin  with  a  mean  vector  of  zero 
and  end  with  the  mean  vector  having  magnitude  A,  either  as  a  jump  or  linear  drift  starting  at  the 
change  point.  For  vignettes  in  which  the  covariance  matrix  or  scale  parameter  changes,  all 
samples  start  out  with  unit  parameters  and  end  with  the  parameter  multiplied  by  a  value  of  (1  -i- 
A),  but  only  in  the  first  variate.  Multivariate  normal  mixtures  generate  observations  with  a  zero 
mean  and  an  identity  covariance  matrix  with  a  probability  0.9,  and  observations  with  a  zero 
mean  and  16  times  the  identity  covariance  matrix  with  probability  0.1.  Multivariate  Weibull 
vectors  are  comprise  of  independent,  univariate  Weibulls  with  shape  parameters  =1.5  and  scale 
parameters  =  1.  All  simulations  were  conducted  with  a  nominal  test  level  of  a  =  0.05;  thus, 
power  when  A  =  0  should  be  within  simulation  error  of  the  nominal  test  level.  Two-sided 
confidence  intervals  are  computed  as  in  Devore  (2004). 

Our  results  successfully  recreate  the  published  performance  characteristics,  and  in  some 
cases  (highlighted  in  yellow),  strongly  suggest  that  the  true  power  level  of  the  ESPM  test  may  be 
higher  than  previously  thought.  This  reaffirms  the  impressive  nature  of  graph-theoretic  tests  that 
use  ensembles  to  elicit  information  from  data  and  assigns  great  potential  value  to  any  effort 
seeking  to  bring  these  procedures  into  widespread  usage  in  modern  applications. 
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Table  3:  Verification  of  simulated  powers  for  ESPM  test  under  different  distributions 
and  change  scenarios  based  on  a  total  of  1000  simulations,  a  =  0,05,  N  =  200,  change 
point  k  -  101,/?  =  dimensionality.  All  confidence  tests  are  at  95%  level. 


Jump 

Drift 

R/K 

Tested 

Low 

High 

Tested 

A 

Value 

Value 

Cl 

Cl 

R/K  Value 

Value 

Low  Cl 

High  Cl 

Multivariate  not 

mal,  9  =  mean 

p  =  5 

0 

0.04 

0.057 

0.044 

0.073 

0.06 

0.064 

0.050 

0.081 

0.5 

0.60 

0.624 

0.594 

0.653 

0.27 

0.280 

0.253 

0.309 

1 

1.00 

0.999 

0.994 

1.000 

0.84 

0.853 

0.830 

0.874 

Multivariate  normal,  0  =  mean. 

p  =  20 

0 

0.05 

0.062 

0.049 

0.079 

0.05 

0.064 

0.050 

0.081 

0.5 

0.33 

0.352 

0.323 

0.382 

0.13 

0.135 

0.115 

0.158 

1 

0.95 

0.976 

0.965 

0.984 

0.56 

0.585 

0.554 

0.615 

Multivariate  normal,  t 

=  covariance  matrix,  p  = 

5 

0 

0.05 

0.062 

0.049 

0.079 

0.05 

0.057 

0.044 

0.073 

0.5 

0.97 

0.973 

0.961 

0.981 

0.52 

0.537 

0.506 

0.568 

1 

1.00 

1.000 

0.996 

1.000 

1.00 

0.999 

0.994 

1.000 

Multivariate  normal  mixture,  9  =  mean,  p  =  5 

0 

0.04 

0.064 

0.050 

0.081 

0.06 

0.057 

0.044 

0.073 

0.5 

0.56 

0.548 

0.517 

0.579 

0.21 

0.239 

0.214 

0.267  *** 

1 

0.99 

0.997 

0.991 

0.999 

0.76 

0.783 

0.756 

0.807  *** 

Multivariate  Weibull, 

0  =  scale  parameter,  p  = 

5 

0 

0.06 

0.064 

0.050 

0.081 

0.05 

0.057 

0.044 

0.073 

0.5  *** 

0.70 

0.771 

0.744 

0.796 

0.35 

0.426 

0.396 

0.457  *** 

1 

0.99 

0.996 

0.990 

0.998 

0.86 

0.901 

0.881 

0.918  *** 

***lndicate  situations  simulated  where  power  level  exceeds  published  power  levels  with 

statistically  significant  difference. 

With  the  ESPM  test  thoroughly  explored,  we  now  transition  over  to  new  work  to  extend 


the  ESPM  test  for  use  in  online  situations. 
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B.  MULTIPLE  TESTING  THEORY 

In  our  Introduction,  we  stated  that  the  goals  behind  our  new  test  are  to  identify  change 
correctly  when  it  occurs  and  detect  that  change  as  quickly  possible  after  it  does  occur.  The  first 
goal  is  one  common  to  all  change  detection  tests;  the  second  though  comes  about  as  a  result  of 
our  focus  on  online  situations  where  testing  occurs  as  data  is  acquired.  To  detect  change  as 
quickly  as  possible,  we  make  use  of  multiple  testing  theory  and  seek  to  test  every  time  new 
observations  are  added  to  the  data  instead  of  waiting  for  all  observations  to  be  collected.  A 
sample  of  N  observations  allows  for  as  many  as  iV  —  3  sequential  tests  in  order  to  detect  any 
possible  change  up  until  all  N  observations  have  been  collected  given  that  the  ESPM  test 
assumes  there  are  at  least  four  observations  to  start  with  when  first  testing.  Although  this  extra 
testing  allows  for  potential  early  warning,  it  raises  two  questions:  1)  how  do  we  maintain  overall 
test  level  and  control  the  Type  I  error  rate  and  2)  what  data  ought  to  be  included  in  each  test  as 
new  observations  arrive? 

Depending  on  the  size  of  N,  this  approach  can  obviously  lead  to  a  high  number  of 
hypothesis  tests  and  a  problematic  increase  in  Type  I  Error  as  seen  in  Table  4.  Eor  example,  the 
monitoring  of  a  certain  machine  may  result  in  performing  200  separate  hypotheses  tests  to  find 
change.  Through  the  use  of  a  standard  test  level  a  =  0.05,  it  is  expected  that  10  tests  will  be 
deemed  “significant”  and  flag  positive  for  change  even  when  the  null  hypothesis,  Hq,  is  actually 
true.  In  general  form,  there  is  probability  a  of  a  Type  I  error  and  complementarity  probability 
(1  —  a)  that  a  false  positive  is  avoided  in  any  given  test.  It  follows  that  if  m  independent 
hypothesis  tests  are  performed,  there  is  probability  (1  —  a)  of  no  false  positives  in  m  tests. 


Table  4:  Breakdown  of  possible  outcomes  in  hypothesis 
and  their  respective  probabilities  of  occurrence. 


'' 

fype  I  and  II  Errors 

^edsion^^^^^ 

Actual  Condition 

Hq  True 

Hq  False 

Do  Not  Reject  Hq 

Correct  Decision 
(1-  a) 

Incorrect  Decision 
Type  II  Error 

P 

Do  Reject  Hq 

Incorrect  Decision 
Type  I  Error 

a 

Correct  Decision 
(1- 

a  =  P(Type  I  Error)  and  p  =  P(Type  II  Error) 
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Figure  9:  Type  I  error  rate  as  the  number  of  independent  hypotheses  (m) 

under  test  increases. 

Thus,  a  probability  1  —  (1  —  a)  of  making  at  least  one  Type  I  error  exists  when  m 
independent  tests  are  performed.  Figure  9  demonstrates  how  quickly  the  Type  I  error  rate,  or 
chance  of  encountering  at  least  one  false  positive,  rises  as  m  is  increased. 

Table  5  summarizes  the  situation  in  a  traditional  form  (see  for  example  Benjamini  and 
Hochberg  (1995)).  There  are  m  hypotheses  to  be  tested,  of  which  tuq  are  true  for  the  null 
hypothesis;  R  serves  conversely  as  the  number  of  hypotheses  rejected.  R  is  an  observable 
random  variable,  whereas  U,  V,  S,  and  T  are  unobservable  random  variables.  V  is  the  number  of 
Type  I  errors.  If  each  null  hypothesis  is  tested  individually  at  level  a,  then  R  increases  as  a 
increases.  Use  of  equivalent  lower  case  letters  is  done  to  signify  the  realized  values  of  the  same 
variables. 


Table  5:  Number  of  errors  committed  when  testing  m  hypotheses 


Declared 

nonsignificant 

Declared  significant 

Total 

True  null  hypotheses 

U 

V 

ruo 

Non-true  null  hypotheses 

T 

s 

m  -  tuq 

m  -  R 

R 

m 
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There  are  various  methods  for  controlling  the  Type  I  error  rate  and  maintaining  overall 
test  level.  In  terms  of  the  random  variables  from  Table  5,  the  per-comparison  error  rate  (PCER) 
is  the  expected  value  of  Type  I  errors  to  the  number  of  tested  hypotheses;  that  is,  PCER  = 
^[Vj/m  .  Testing  each  hypothesis  individually  at  level  a/m  guarantees  that  ^[Vj/m  <  a, 
which  means  that  PCER  attempts  to  minimize,  but  not  eliminate  Type  I  errors.  The  family-wise 
error  rate  (EWER)  is  the  probability  of  at  least  one  Type  I  error  occurring  during  testing,  where 
EWER  =  P(V  >  1).  Testing  each  hypothesis  individually  at  level  a  guarantees  that  (V  >  1)  < 
a  .  This  method  specifically  seeks  to  guard  against  even  one  Type  I  error  occurring.  Easily,  the 
false  discovery  rate  (EDR)  is  the  expected  proportion  of  Type  I  errors  to  the  number  of  rejected 
hypotheses.  Unlike  many  classical  approaches  which  control  EWER  in  the  strong  sense,  our  new 
test  will  use  EDR  to  control  the  Type  I  error  rate,  which  we  will  discuss  in  the  following  section. 

In  offline  situations,  testing  is  naturally  done  upon  all  the  data  collected  as  it  would  be 
detrimental  to  test  efficacy  to  leave  any  observations  out.  In  online  situations,  however,  there  is  a 
decision  to  be  made  concerning  which  available  data  to  include  in  each  test.  We  investigated  two 
techniques  for  applying  multiple  testing  to  online  change  detection  situations  before  ultimately 
moving  forward  with  one. 

Consider  a  sample  i?  of  iV  d-dimensional  observations.  In  overlap  testing,  the  first  test  of 
the  sequence  is  a  subgroup  of  /  observations  spanning  from  indices  I  to  /.  Eor  each  subsequent 
test,  both  the  front  and  back  index  shift  by  a  value  of  s  to  incorporate  the  newest  available  data 
without  the  subgroup  being  tested  growing  beyond  /  observations  in  size.  In  telescope  testing, 
the  first  test  of  the  sequence  is  again  a  subgroup  of  size  /  spanning  from  indices  I  to  /.  In  this 
case  though,  for  each  subsequent  test  only  the  larger  index  shifts  by  a  value  of  s  to  incorporate 
the  newest  data  causing  the  tested  subgroup  to  grow  by  s  observations  with  every  test.  The 
subgroup  of  the  final  test  has  a  size  of  N  and  is  equal  to  R.  Our  new  test  uses  telescope  testing. 

The  answers  to  both  of  the  questions  we  posed  at  the  beginning  of  this  section  directly 
affect  the  nature  of  our  online  extension  to  the  ESPM  test  and,  in  turn,  ultimately  determine  its 
viability.  We  will  elaborate  further  on  the  theory  and  motivation  behind  these  decisions. 
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1.  False  Discovery  Rate 

Here  we  follow  the  groundbreaking  work  of  Benjamini  and  Hochberg  (1995).  FDR  is  the 
expected  value  of  the  proportion  of  rejected  null  hypotheses  that  are  falsely  rejected.  This 
proportion  may  be  expressed  by  the  random  variable  Q  =  V  /  (V  +  S).  When  V  +  S  =  0,  we 
define  Q  =  0,  as  in  false  rejection  cannot  occur  if  no  null  hypothesis  is  rejected  in  the  first  place. 
Q  is  an  unobserved  random  variable  because  it  is  impossible  to  exactly  know  the  values  of  v  and 
s,  and  therefore  q  —  v/(v  +  s),  even  after  all  experimentation  and  data  analysis  is  completed. 
The  FDR  is  defined  as  the  expected  value  of  Q: 

(3.14)  FDR  =  E[Q]=  E[V/R]  =  F[V/(V  +  S)]. 

There  are  many  properties  of  this  error  rate,  but  two  specific  ones  are  particularly 
fundamental,  and  coincidentally  easy  to  show.  Consider  again  a  situation  where  m  hypotheses 
are  being  tested: 

1)  If  all  null  hypotheses  are  true,  the  FDR  is  equal  to  FWER.  In  this  case,  s  =  0  and 

V  =  r  if  any  tests  are  declared  significant,  so  if  v  =  0  then  Q  =  0,  and  if  v  >  0  then 
Q  =  1,  which  results  in  P(V  >  1)  =  £'[Q].  Thus,  control  of  the  FDR  is  also  in  fact 
control  of  the  FWER  in  an  indirect  and  less  effective  manner. 

2)  In  situations  where  tUq  <  m,  the  EDR  is  less  than  or  equal  to  the  EWER.  In  these 
cases,  if  i;  >  0  then  v Jr  <  1,  and  P(V  >  1)  >  P[Q].  Therefore,  any  procedure  that 
controls  the  FWER  intrinsically  controls  the  EDR.  Yet,  any  procedure  that  controls 
the  EDR  only  will  likely  be  less  stringent  and,  in  turn,  a  gain  in  power  will  result. 

This  becomes  especially  apparent  the  more  non-true  null  hypotheses  exist  in  the  data. 
S  tends  to  be  larger  and  so  does  the  difference  between  the  error  rates;  ergo  the 
potential  for  increases  in  power  is  larger  the  closer  mo/m  is  to  1. 

To  control  the  random  variable  Q  at  each  hypothesis  declared  significant  would  be 
optimal.  This  is  impossible  though.  If  tUq  =  m  and  even  a  single  null  hypothesis  is  rejected,  then 
v/r  =  1  and  Q  cannot  be  controlled  because  every  rejection  is  false.  Adding  in  the  extra 
condition  (V/R  |  R  >  0)  does  not  alleviate  the  problem,  and  neither  does  P[V/R|R>0]. 
Instead,  EDR  equally  expressed  as  P[  V/R  |  R  >  0]  ■  P(R  >  0),  which  is  possible  to  control. 
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The  dominant  influence  for  the  method  by  which  we  control  FDR  in  our  new  test  is  the 
famous  Benjamini-Hochberg  Procedure  (1995),  which  for  independent  test  statistics  and  any 
arrangement  of  non-true  null  hypotheses  impressively  controls  FDR  at  a*  via  a  linear  adjustment 
of  test  levels.  Consider  testing  m  hypotheses  (/fi,  H2, ... ,  with  corresponding  p-values 
(Pi, P2, ... , Pm)-  Let  P(-i)  <  P(-2)  <  •••  <  P(yn)  be  the  ordered  p-values  and  Hj-q  the  null  hypothesis 
corresponding  to  Pity  They  prove  that  the  following  Bonferroni-type  multiple  testing  decision 
rule  controls  FDR  at  overall  test  level  a*\ 

(3.15)  let  k  be  the  largest  i  for  which  Pj-q  <  ^a*; 

then  reject  all  //(q  for  i  —  (1,2, ... ,  k}. 

2.  Extension  of  Benjamini-Hochberg  Procedure  to  Online  Scenarios 

Although  the  Benjamini-Hochberg  Procedure  (1995)  controls  the  FDR,  it  is  designed  for 
use  in  offline  situations  only.  Three  of  its  basic  assumptions  make  it  unable  to  work  in  online 
scenarios  in  its  current  form.  First,  the  procedure  assumes  that  all  hypotheses  have  been  tested 
and  have  corresponding  p-values  before  its  implementation.  In  an  online  problem,  where  testing 
is  ongoing  as  new  observations  are  collected,  this  caveat  would  force  the  user  to  wait  until  all 
data  was  collected.  Second,  the  procedure  as  it  stands  requires  the  number  of  tested  hypotheses 
to  be  determined  prior  to  its  use.  Online  scenarios,  by  contrast,  involve  ongoing  processes  for 
which  the  end  is  not  explicitly  known.  For  example,  a  mechanical  system  being  observed  might 
be  a  “lemon”  and  experience  system  failure  soon  after  it  starts  being  used  and  tested,  or  it  could 
give  a  decade  of  faithful  uninterrupted  service  and  provide  a  lot  of  observations  for  testing.  This 
makes  it  nearly  impossible  to  determine  how  many  tests  are  going  to  be  conducted  prior  to  use. 
Thirdly,  Benjamin!  and  Hochberg  (1995)  stipulate  that  each  hypothesis  must  be  independent 
from  the  others.  It  turns  out  that  in  many  real-world  applications,  dependency  exists  among  the 
tested  hypotheses.  We  will  now  proceed  to  discuss  our  solutions  to  these  problems  before  giving 
an  exact  definition  of  our  procedure  extending  control  of  the  FDR  to  online  situations. 

Because  online  problems  involve  incremental  accumulation  of  data,  one  may  only  test 
some  subset  of  all  null  hypotheses  at  any  given  time.  As  a  conservative  approach,  we  treat  all 
untested  hypotheses  as  if  they  had  been  tested,  with  p-values  all  equal  to  1.  Once  pertinent  data 
are  available  to  test  a  yet-untested  hypothesis,  we  replace  the  “placeholder”  p -value  of  1  with  the 
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actual  value  and  then  order  all  the  p-values.  The  key  to  our  proeedure  is  the  faet  that  the 
Benjamini-Hoehberg  Proeedure  evaluates  ordered  p-values,  and  our  plaeeholding  seheme 
simply  ensures  that  that  the  p-values  assoeiated  with  untested  hypotheses  are  guaranteed  to  be 
last  in  order.  We  outline  this  proeedure  in  detail  shortly. 

To  eombat  the  problem  of  uneertain  endpoints,  we  ehoose  an  observation  horizon  to 
determine  the  eorrect  linear  adjustment  of  test  levels  prior  to  applying  the  procedure.  Based  on 
the  typical  operating  profile  of  the  system  being  studied,  this  horizon  H  is  set  at  N  observations 
and  allows  all  observations  less  than  or  equal  to  N  in  sequence  label  to  be  available  for  testing. 
Any  observation  greater  than  N  in  sequence  label  goes  untested  as  seen  in  Figure  10,  where  the 
horizon  for  a  general  meehanieal  system  is  set  at  150  observations,  but  the  machine  lasts  for  200 
observations  eausing  the  last  50  to  go  untested.  In  general,  H  ought  to  be  set  to  include  the  entire 
range  of  interest  for  possible  change  deteetion,  but  not  be  excessively  large  (whieh  ean  reduee 
test  power).  We  discuss  this  //-selection  issue  more  in  our  results  section. 

The  solution  for  the  problem  of  dependeney  among  hypotheses  eomes  from  Benjamin!  and 
Yekutieli  (2001).  Realizing  that  dependent  test  statisties  are  eneountered  often  when  trying  to 
eontrol  the  FDR  in  praetiee,  they  develop  a  new  proeedure  that  eontrols  the  FDR  for  positively 
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Figure  10:  Horizon  H  set  at  150  observations  for  a  machine  that  happens  to 

live  for  200  observations. 
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dependent  test  statistics.  For  other  cases  of  dependency,  they  prove  that  the  same  procedure  can 
be  easily  modified  to  control  the  FDR,  but  the  resulting  procedure  is  more  conservative.  Since 
the  exact  distribution  of  Kpj  is  not  known,  we  are  unable  to  formally  prove  that  our  test  statistic 
meets  the  dependency  conditions  set  forward  by  Benjamini  and  Yekutieli  (2001)  in  order  to  use 
their  procedure.  But,  the  simulation  study  we  conduct  in  Section  IV  points  to  the  result  that  Kpj 
meets  these  conditions  because  test  efficacy  under  the  alternative  hypothesis  is  not  unreasonably 
suppressed  compared  to  the  efficacy  of  the  original  offline  test  whereas  the  opposite  would  be 
expected  if  Kpj  did  not  meet  the  conditions.  With  nothing  suggesting  so  far  that  it  violates  these 
dependency  conditions  in  any  important  way,  we  make  use  of  the  procedure  in  our  new  test. 

3.  Online  Extension  Procedure 

Consider  testing  m  independent  true  null  hypotheses  (Hi,  H2,  ■■■,  with  corresponding 
p-values  (Pi,  P2, ... ,  Prn)  on  those  observations.  Let  P(-i)  <  P(^2)  ^  ^  P{m)  be  the  ordered  p- 

values  and  H(j)  the  null  hypothesis  corresponding  to  P(j)  and  let  P(^)  =  (P(i),  P(2),  — ,  P{m))- 
Suppose  these  p-values  are  realized  in  sequence;  that  is,  at  step  j  only  p-values  Pi  through  Pj  are 
known.  We  propose  the  following  on-line  testing  procedure  for  each  step  j  G  (1, ... ,  m}  for  m 
sequential  observations: 

(1)  Let  a*  be  the  desired  overall  test  level,  and  let  Qq  =  (1,  1, ... ,  1). 

(2)  Let  Qy  =  (Qi,  Q2, ... ,  Qj,  1, 1, ...  ,1)  for  1<]  <m,  where  Qi  ^  Pi  Vi  <  j  . 

(3)  Let  Qq)  =  ((2(1),  (2(2), ... ,  (2(;),  1, 1, ...  ,1)  were  (2(0  is  the  ordered  value  in  Qj. 

(3.16)  -  If  there  exists  some  k  <  j  and  some  i  <  j  such  that  (2/c  =  (2(t)  ^  ^  then  reject 

and  declare  that  some  change  has  occurred  in  the  distribution  sequence. 

-  If  (2(0  ^  —  J’  back  to  step  (2)  with  ]  =  ]  +  1  until  j  =  m. 

That  this  procedure  controls  FWER  at  the  desired  level  is  established  in  the  following  result: 

Theorem:  For  m  independent  test  statistics  and  m  independent  true  null  hypotheses  in  online 
scenarios,  the  procedure  (3.16)  controls  the  FWER  at  level  a* . 

Proof:  Eet  V  be  the  number  of  Type  I  errors  in  the  offline  Benjamini-Hochberg  test.  Eet  Vj  be 
the  number  of  Type  I  errors  through  step  j.  By  Benjamini  and  Hochberg  (1995),  P(F  >  1)  <  a*. 
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We  must  show  that  for  any  j  G  {1, ... ,  m},  P(l4  >  1  or  1^2  >  1  or  ...  or  Vj  >  l)  <  a*.  So,  choose 
j  G  {1, ... ,  m]  and  notice  that  Ki,  V2, ...  ,Vj  is  a  monotone  non-decreasing  sequence.  Therefore, 
P(l4  >  1  or  1^2  >  1  or  ...  or  V,-  >  l)  =  P(V,-  >  l).  Next,  observe  that 

(3.17)  P(Vj>l)=P  (Oa)  or  0(a  <  ^  or ...  or  Q,,)  <  • 

Now  recognize  that  P(^)  <  Qy)  (where  the  “<  "  relationship  is  taken  element-wise),  so 

/  a*  2a*  3a*  ja*\ 

P  (2(1)  <  —  or  (2(2)  < -  or  (2(3)  < - or  ...  or  (2())  <  — 

\  ^  ^  m  ^  ^  m  ^  ^  m  m  J 

(  a*  2a*  3a*  ja*  \ 

-  ^  V  -  m  ^  ^  ^ 

(3.18)  =P(T>l)<a*. 

Therefore,  P(V)  >  l)  <  a*  for  all  j  G  (1,  ...,m}.  ■ 

Figure  1 1  illustrates  a  linear  adjustment  of  test  levels  created  using  the  online  extension 
procedure  for  a  scenario  with  five  hypotheses  and  a  desired  overall  test  level  a*  =  0.05. 


Figure  11:  FWER  Linear  Adjustment  for  5  Tests,  a*  =  0. 05 
4.  Hypothesis  Selection  Procedure 

With  more  tests  being  conducted  due  to  multiple  testing,  more  thought  must  be  put 
forward  in  regards  to  the  structuring  of  the  hypotheses  being  tested.  A  faulty  hypothesis  structure 
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can  rule  ineffective  any  and  all  testing  that  is  conducted.  If  tested  subgroups  are  too  small  or  skip 
over  too  much  data,  then  detection  power  will  suffer.  Each  pair  of  new  observations  added  to  a 
data  set  can  substantially  change  the  MNBMs  computed  in  calculating  Kpj;  it  is  not  necessary  to 
wait  for  large  groups  of  observations  to  be  acquired  in  between  tests.  So  we  seek  to  reflect  that  in 
our  chosen  methodology. 

Our  new  test  selects  hypotheses  for  testing  through  what  we  call  telescope  testing.  We 
initially  wait  for  a  small  group  of  observations  to  be  available  for  the  first  test  which  utilizes  all 
data  present  and  then  proceed  to  test  on  all  available  data  each  time  observations  are  added  to  the 
data  set.  As  a  result,  the  testing  region  grows  in  size  like  a  telescope  with  each  subsequent 
hypothesis  being  tested  including  slightly  more  data  than  the  previous  hypothesis.  For  example, 
consider  an  online  change-point  problem  with  the  horizon  set  at  200  observations  where  up  to  m 
hypotheses  (Hi,  H2, ... ,  may  be  tested.  The  initial  hypothesis,  is  tested  when  there  are 
20  observations  available  and  every  ensuing  hypothesis,  is  tested  each  time  two 
observations  are  added  to  the  data  set,  so  hypothesis,  H(i+j)<^,  is  tested  on  20  +  2i  observations. 
The  final  hypothesis,  is  tested  upon  200  observations.  This  selection  of  hypotheses  creates 
the  need  for  m  =  91  hypotheses  to  be  considered  and  so  the  linear  adjustment  of  the  FDR  is 
constructed  to  handle  up  to  and  including  91  tests  as  seen  Figure  12. 


Figure  12:  FDR  Linear  Adjustment  for  91  Tests,  a*  =  0. 05 
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With  our  new  test  fully  explained,  we  will  now  proceed  to  demonstrate  its  performance 
through  two  studies.  The  first  uses  simulated  data  to  investigate  the  power  and  advance  warning 
capabilities  of  our  test  while  varying  the  change -point,  magnitude  of  change,  and  dimensionality 
The  other  occurs  on  pseudo  real-world  data  and  allows  us  to  see  whether  or  not  our  test  can 
potentially  handle  the  rigors  of  powerfully  detecting  change,  and  detecting  it  early,  in  real-world 
scenarios. 
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IV.  TEST  PERFORMANCE 

The  new  test  presented  in  this  paper  is  only  useful  if  it  detects  change  reasonably  well- 
ahead  of  when  it  would  be  detected  in  an  offline  situation.  In  this  section,  we  present  two 
different  investigations  into  the  performance  characteristics  of  our  test.  The  first  is  a  simulation 
study  that  examines  the  power  of  our  new  test  and  its  capability  to  detect  change  ahead  of  the 
ESPM  test.  The  second  is  an  application  of  our  test  to  simulated  real-world  data  from  the 
International  Conference  on  Prognostics  and  Health  Management  Data  Challenge  in  2008. 

The  two  primary  performance  metrics  that  we  use  in  both  studies  are  detection  power, 
where  power  is  defined  as  the  probability  of  rejecting  the  null  hypothesis  when  it  is  false,  and 
advanced  warning,  which  is  the  number  of  observations  prior  to  test  detected  change.  Interpoint 
cost  is  determined  by  the  Euclidean  distance  function  given  by  (2.6). 

A.  SIMULATION  STUDY 

1.  Methodology 

Here  we  simulate  a  variety  of  change -point  scenarios  to  quantify  the  power  of  this  new 
test  and  its  ability  to  provide  advance  warning  of  change  unavailable  in  offline  testing.  Each 
power  estimate  in  the  tables  and  figures  of  this  section  is  the  fraction  of  times  the  new  test 
detected  change  under  the  given  conditions  based  on  1000  simulations.  Each  advanced  warning 
estimate  is  the  average  of  the  number  of  observations  prior  to  the  simulated  end  of  life  at  which 
change  is  detected.  If  no  change  is  detected,  advanced  warning  is  zero  for  the  purposes  of 
computing  the  mean.  In  every  case,  total  sample  size  is  N  =  200,  the  test  significance  level  is 
a  =  0.05,  and  sample  space  is  The  choice  of  iV  =  200  is  based  on  the  desire  to  directly 
compare  the  performance  of  our  online  test  to  the  original  offline  form  of  the  ESPM  test,  while 
concurrently  avoiding  extremely  long  computation  times  for  larger  optimal  nonbipartite 
matchings.  This  is  equivalent  to  setting  a  horizon  at  200  observations. 

Observations  were  simulated  using  RStudio  each  observation  is  drawn  from  a  standard 
five- variate  normal  distribution  with  density  function  of  the  form: 

(■t.l)  ►■.d)  =  V  X  e  R=, 
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where  (j,  G  and  S  G  are  the  distribution  mean  and  covariance  matrix  respectively.  The 
pre-change-point  data  for  this  multivariate  normal  case  have  p,  =  0  and  I.  =  where  is  the 
5x5  identity  matrix.  The  post-change-point  data  have  a  different  mean  or  scale  as  specified  later 
in  this  section. 

We  use  three  different  formats  to  select  the  way  in  which  tests  are  conducted.  In  the  first, 
we  wait  until  there  are  20  observations  to  first  test,  and  then  proceed  to  test  every  time  two 
additional  observations  are  added  to  the  data.  So,  we  test  at  20  observations,  22,  24,  26. . .  all  the 
way  until  200  (“20-2”).  Under  this  format,  there  is  a  possibility  for  up  to  91  tests  to  be  conducted 
contingent  on  if  change  is  detected  prior  to  the  last  test.  The  second  format  again  begins  testing 
at  20  observations  and  then  tests  every  time  an  additional  20  observations  are  added  to  the  data 
(“20-20”).  So,  we  test  at  20,  40,  60,. . .,  200.  Here  there  can  be  up  to  10  tests.  The  third  and  final 
format  begins  testing  at  40  observations  and  tests  for  every  40  additional  observations  after  that 
(“40-40”).  So,  we  test  at  40,  80,  120,. . .,  200  and  there  can  be  a  total  of  5  tests.  By  only  testing 
when  there  is  an  even  number  of  observations  within  all  three  formats,  we  maintain  a  situation 
conducive  to  nonbipartite  matching  without  the  need  to  create  dummy  points. 

In  this  study,  we  consider  various  change -points.  For  our  purposes,  change  location 
fraction  c*  is  defined  as  where  cp  is  the  change  point  as  before.  Notice  that  the  value  of  the 

denominator  is  the  sample  size  N,  not  the  size  of  any  specific  tested  subgroup.  We  examine 
values  of  0.10,  0.25,  0.335,  0.50,  0.665,  0.75,  and  0.90  for  c*,  which  corresponds  to  change  at 
observations  21,  51,  67,  101,  133,  151,  or  181. 

For  each  of  these  values  of  c*,  we  consider  change  magnitudes  A  of  0,  0.5,  and  1.0.  When 
A  =  0,  the  null  hypothesis  is  true  and  the  power  estimate  is  an  estimate  of  the  test’s  Type  I  error 
rate.  In  ideal  situations,  the  Type  I  error  rate  is  equal  to  the  chosen  significance  level.  When  A  > 

0,  A  indicates  the  total  magnitude  of  change.  We  only  look  at  jump  changes,  where  A  is  the 
magnitude  of  the  abrupt  change  that  occurs  at  the  designated  change  point. 

Our  simulations  only  look  at  changes  in  distribution  mean.  Without  loss  of  generality, 
this  change  is  implemented  in  the  first  component  of  each  observation  only  in  the  following  form: 

(4.2)  Xj  ~  Fi^ypj,  i  <  (p 

Xj  +  (A,  0, 0,  0,  0)~  i  >  (p 
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due  to  the  rotational  invariance  of  F^vn-  Thus,  the  post-change-point  data  for  the  multivariate 
normal  case  have  \i  —  (A,  0,0, 0,0)'  and  1  —  I s- 

2.  Performance  Results 

Figures  13-15  and  Tables  6-7  display  power  estimates  for  our  new  test  at  a  significance 
level  of  a  =  0.05.  The  estimated  realized  significance  level  for  the  test  with  N  =  200  is 
ccactuai  —  0.04435  giving  credence  to  our  new  method’s  ability  to  control  the  overall  test  level. 
Critical  values  for  the  test  statistic  are  determined  from  the  simulations  run  by  Ruth  (2009). 


As  expected,  we  observe  a  distinct  improvement  in  power  from  the  A  =  0.5  case  to  the  A 
=  1  case.  Also  it  is  known  that  ESPM  test  power  is  degraded  when  the  change  point  is  in  the  tails 
of  the  tested  subgroup  (away  from  the  middle)  the  test  suffered  somewhat,  since  fewer  pre-  or 
post-change  data  (depending  on  the  location)  are  present  to  indicate  a  change.  Our  results  again 
generally  show  this  to  be  true.  For  the  A  =  1  case,  the  average  powers  corresponding  to  c*  values 
of  0.25,  0.335,  0.5,  0.665,  and  0.75  are  at  or  above  75%  for  all  three  testing  formats  whereas  both 
tails  fall  below  that.  Comparing  the  two,  power  at  the  left  tail  where  c*  =  0.1  exceeds  that  at  the 
right  tail  where  c*  =  0.9.  This  is  likely  because  when  c*  =  0.1  the  change  point  remains  in  the 
tested  subgroup  for  nearly  all  tests  conducted,  while  when  c*  —  0.9  post-change  data  does  not 
enter  the  tested  subgroup  until  the  last  few  tests  making  change  difficult  to  detect.  For  the  20-20 
and  40-40  formats,  power  at  c*  =  0.1  and  c*  =  0.9  is  higher  than  seen  in  the  20-2  format.  This 
effect  appears  to  result  from  more  stringent  step-wise  adjustment  of  critical  values  for  the  20-2 
format.  With  the  potential  for  up  to  91  tests  taking  place,  each  step  of  the  adjustment  is  smaller 

(cc*  2.(x* 

— , ...  1  is  more  restrictive  than  for  the  first  steps  of  the  other  formats  (20-20:  — ,  — ,  — , ... 

91  /  ^  ^  10  10  10 


cc*  3cc* 

and  40-40:  — ,  — , ...).  For  the  A  =  0.5  case,  the  same  general  relationship  exists  between 

formats  and  values  of  c*  except  power  estimates  are  much  lower  with  highs  of  40-50% 
depending  on  the  format.  The  highest  power  estimates  from  the  new  online  test  in  the  A  =  1 
(~95-100%)  case  are  essentially  equivalent  to  offline  power  estimates  for  the  ESPM  test 
(~100%),  but  this  similarity  does  not  hold  for  the  A  =  0.5  case  where  the  online  test  loses  some 
power  likely  due  to  the  use  of  step-wise  adjusted  significance  levels  (40%-50%  for  online  test; 
-60%  for  ESPM  test). 
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Average  power  for  simulated  data 
(200  observations;  start  20,  grow  by  2) 


Figure  13:  Average  power  for  testing  format  of  start  20,  grow  by  2, 

across  values  of  c* 


Average  power  for  simulated  data 
(200  observations;  start  20,  grow  by  20) 


Figure  14:  Average  power  for  testing  format  of  start  20,  grow  by  20, 

across  values  of  c* 
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Average  power  for  simulated  data 
(200  observations;  start  40,  grow  by  40) 


Figure  15:  Average  power  for  testing  format  of  start  40,  grow  by  40, 

across  values  of  c* 


Table  6:  Average  simulated  power  across  different  testing 
formats  and  values  of  c*  at  change  magnitude  A  =  0,5 

Multivariate  Normal,  0=mean,  cl=5,  Z\  =  0.5,N=H=200 

Start  (Grow) 


20  (20) 

40  (40) 

20(2) 

0.10 

0.209 

0.179 

0.127 

0.25 

0.394 

0.404 

0.347 

0.335 

0.487 

0.465 

0.411 

0.50 

0.515 

0.499 

0.411 

0.665 

0.307 

0.360 

0.248 

0.75 

0.199 

0.215 

0.146 

0.90 

0.080 

0.083 

0.034 

***Numbers  in  parentheses  represent  telescope  growth  value. 
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Table  7:  Average  simulated  power  across  different  testing 
formats  and  values  of  c*  at  change  magnitude  A  =  1,0 

Multivariate  Normal,  0=mean,  d=5,  A  =  1.0,  N=H=200 

Start  (Grow) 


20  (20) 

40  (40) 

20(2) 

0.10 

0.635 

0.628 

0.499 

0.25 

0.978 

0.974 

0.963 

0.335 

0.995 

0.994 

0.984 

0.50 

0.998 

0.996 

0.990 

0.665 

0.974 

0.975 

0.956 

0.75 

0.851 

0.870 

0.759 

0.90 

0.160 

0.190 

0.085 

***Numbers  in  parentheses  represent  telescope  growth  value. 


Figures  16-18  and  Tables  8-9  present  advance  warning  estimates.  Each  of  the  three 
formats  displays  similar  ability  to  provide  advanced  warning.  The  only  noticeable  differences 
arise  at  the  tails  and  are  very  minor  other  than  the  A  =  1,  c*  =  0.1  case  which  is  a  result  of  the 
reduced  power  of  the  20-2  format  in  that  situation.  We  attribute  the  minor  differences  to  a 
combination  of  natural  variation  and  dissimilarity  between  the  step-wise  growth  and  selection  of 
hypotheses  in  each  testing  format.  Each  warning  estimate  is  also  slightly  affected  by  the  amount 
of  warning  they  can  provide  based  on  the  corresponding  change  location  fraction,  or  c*;  early 
change  locations  are  inherently  able  to  give  more  warning  than  later  ones.  This  appears  to 
override  the  increased  power  found  at  c*  —  0.335, 0.5,  and  0.665.  In  the  A  =  1  case,  the 
warning  estimate  at  c*  =  0.25  completely  overshadows  the  warning  estimates  for  c*  =  0.5  and 
c*  =  0.665,  and  has  a  visually  apparent  difference  over  c*  =  0.335.  Eor  the  A  =  0.5  case,  these 
differences  are  far  less  pronounced,  but  still  there.  The  best  average  advanced  warning  times  of 
this  case  are  about  30  observations  for  all  formats,  while  in  the  right  tail  of  all  formats  the 
estimates  fall  below  20  observations.  Upon  observation,  the  shape  of  the  warning  estimates 
demonstrates  resemblance  to  the  shape  of  the  power  estimates.  This  is  not  surprising  because  the 
more  often  that  change  is  detected,  the  greater  average  advanced  warning  should  be.  Despite  this 
similarity  between  the  shapes  of  the  power  estimates  and  advanced  warning  estimates  though, 
the  maximum  advance  warning  estimate  is  not  found  at  c*  =  0.5.  All  three  formats  provide  the 
best  advanced  warning  when  the  change  point  is  near,  not  at,  the  middle  of  the  first  half  of 
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observations,  giving  the  advanced  warning  distribution  a  positive  skew.  Notice  in  particular  that 
for  c*  =  0.25,  corresponding  to  a  change  point  at  observation  51,  all  three  formats  provide 
approximately  100  observations  of  advanced  warning.  In  other  words,  the  online  test  provides 
indication  that  a  change  has  occurred  with  only  100  observations  available  for  testing,  while  the 
corresponding  offline  test  has  to  “wait”  for  200  observations  to  be  available  for  testing.  This 
example  highlights  the  value  added  by  our  online  test. 


Figure  17 :  Average  advanced  warning  for  testing  format  of  start  20,  grow  by  20, 

across  values  of  c*  and  A 
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Figure  18:  Average  advanced  warning  for  testing  format  of  start  40,  grow  by  40, 

across  values  of  c*  and  A 


Table  8:  Average  advanced  warning  across  different  testing 
formats  and  values  of  c*  at  change  magnitude  A  =  0,5 

Multivariate  Normal,  Q=mean,  cl=5,  A  =  0.5,N=H=200 

Start  (Grow) 


20  (20) 

40  (40) 

20(2) 

0.10 

22.5 

17.3 

11.6 

0.25 

30.8 

30.1 

27.0 

0.335 

31.6 

27.9 

26.5 

0.50 

23.2 

15.4 

16.8 

0.665 

9.0 

8.0 

6.5 

0.75 

7.1 

5.9 

4.6 

0.90 

7.5 

5.5 

2.3 

***Numbers  in  parentheses  represent  telescope  growth  value. 
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Table  9:  Average  advanced  warning  across  different  testing 
formats  and  values  of  c*  at  change  magnitude  A  =  1,0 

Multivariate  Normal,  0=mean,  d=5,  A  =  1.0,  N=H=200 

Start  (Grow) 


20  (20) 

40  (40) 

20(2) 

0.10 

82.2 

77.3 

57.6 

0.25 

105.4 

97.6 

101.5 

0.335 

92.5 

80.5 

92.4 

0.50 

60.3 

49.2 

62.1 

0.665 

27.2 

20.9 

28.5 

0.75 

15.9 

7.5 

14.3 

0.90 

6.5 

6.1 

3.5 

***Numbers  in  parentheses  represent  telescope  growth  value. 


B.  PHM  DATA 

1.  Methodology 

Having  identified  some  of  the  strengths  and  weaknesses  of  our  new  test  through 
simulation,  we  now  turn  our  attention  to  performance  in  real-world  scenarios.  In  such  cases,  data 
are  usually  not  independent  they  rarely  come  from  well-understood  distributions,  and  change 
points  are  not  inserted  by  hand  at  predetermined  locations.  To  apply  our  test  to  real-world  data, 
we  use  data  from  the  International  Conference  on  Prognostics  and  Health  Management  Data 
Challenge  in  2008.  The  data  set  consists  of  218  separate  multivariate  time  series  each  from  a 
different  instance  of  the  same  complex  engineering  system,  referred  to  as  a  “unit”  (e.g.,  the  data 
could  be  from  a  fleet  of  ships  of  the  same  type).  Every  unit  starts  with  different  degrees  of  initial 
wear  and  manufacturing  variation  which  are  unknown  to  the  user.  The  wear  and  variation  are 
considered  typical,  not  part  of  a  fault  condition.  There  are  three  operational  settings  which  have  a 
substantial  effect  on  unit  performance.  For  every  observation  of  each  unit,  there  are  three 
operational  settings  and  21  sensor  measurements.  As  seen  in  the  real-world,  the  data  is 
contaminated  with  sensor  noise. 

At  the  beginning  of  each  time  series,  the  unit  is  operating  normally  and  then  develops  a 
fault  at  some  point  during  the  series.  This  fault  grows  in  magnitude  until  system  failure  occurs  at 
the  last  observation  in  the  series.  Our  goal  is  to  detect  that  a  change  has  occurred  and  do  so  prior 
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to  the  last  observation  available  for  that  unit.  Some  units  have  as  few  as  128  observations  and 
some  have  as  many  as  357  observations.  Each  “power  fraction”  in  the  tables  and  figures  of  this 
section  is  the  number  of  units  for  which  change  was  detected  prior  to  system  failure.  Advanced 
warning  estimates  are  the  average  number  of  observations  prior  to  the  last  one  in  the  series  when 
change  is  detected.  Although  some  units  have  more  observations  than  others  and  can  provide 
more  warning,  we  treat  all  equally  when  computing  the  mean  advanced  warning. 

Due  to  the  real-world  nature  of  this  data,  certain  challenges  arise  when  trying  to  apply  our 
test.  Control  variables  like  the  ones  present  in  the  PHM  data  cause  multiple  levels  of  normal 
operation  as  seen  in  Figure  19.  The  large  differences  between  these  levels  can  cause 


Example  Subset  of  PHM  Data  (Unit  1,  Sensor  3) 


Figure  19:  Subset  of  PHM  data  displaying  various  levels  of 
normal  operation  corresponding  to  different  control  settings 


graph-theoretic  tests  and  their  respective  cost  functions  to  signal  change  caused  by  changes  in 
operational  setting:  clearly  it  is  undesirable  to  signal  unwanted  change  in  response  to  such 
expected  changes  in  response  variables.  To  overcome  this  challenge,  one  might  may  choose  to 
center  and/or  scale  the  data  based  on  control  variable  information.  For  our  study,  we  choose  to 
center,  but  not  scale,  the  PHM  data.  We  averaged  the  first  ten  observations  at  each  setting  of  the 
first  control  variable  within  every  unit  as  a  typical  baseline  for  normal  operation.  Then,  the 


48 


baseline  for  each  setting  was  subtracted  from  every  observation  with  the  corresponding 
operational  setting. 

Another  potential  problem  is  autocorrelation.  In  the  independent  and  identically 
distributed  case,  the  value  of  a  current  observation  does  not  depend  at  all  on  previous 
observations.  However,  the  real-world  often  fails  to  operate  in  this  way.  Observations  made  on 
the  same  system  close  in  time  to  each  other  are  naturally  predisposed  to  be  close  to  each  other, 
and  therefore  in  our  test  be  paired  with  each  other.  For  example,  consider  an  area  whose 
geographic  location  inclines  it  to  experience  many  slowly  developing  and  moving  low  pressure 
systems  in  the  atmosphere,  and  these  systems  cause  persistence  to  daily  rainfall.  The  daily 
weather  in  this  area  is  a  result  of  the  past  behavior  of  these  systems  and  foreshadows  the  next 
day’s  weather  as  well. 

In  this  study,  we  use  only  the  20-2  format  to  conduct  tests  as  it  is  the  format  that  most 
immediately  processes  newly  available  data.  The  20-20  and  40-40  formats  demonstrated  slightly 
better  performance  characteristics  than  the  20-2  format  in  computer  simulations,  if  the  real-world 
application  of  interest  can  accommodate  those  lower  resolution  formats,  then  they  should  be 
considered  for  use  as  well.  Because  each  unit  has  a  different  number  of  observations,  each  one 
has  a  different  number  of  tests  that  can  be  conducted  on  it.  As  such,  we  set  the  horizon  at  a 
variety  of  levels  consistent  with  possible  machine  lifespans  in  order  to  determine  the  proper  step¬ 
wise  adjustment. 

In  contrast  to  our  first  study,  we  do  not  know  the  type  of  change  being  implemented  upon 
the  system,  or  even  when  it  occurs. 

2.  Performance  Results 

Figure  20  displays  the  power  achieved  when  the  overall  test  level,  a*,  is  fixed  at  0.05 
with  varied  horizon  settings  and  is  representative  of  the  graph  seen  for  other  values  of  a*.  The 
lowest  power  appears  in  the  left  tail  where  H  =  125.  From  there,  the  graph  sharply  rises  to  a 
power  fraction  close  to  1.0.  This  demonstrates  that  if  the  selected  horizon  is  sufficiently  long  and 
within  the  typical  operating  profile  of  the  machine,  that  is  to  say  not  overly  long,  then  detection 
of  change  prior  to  system  failure  is  virtually  certain.  If  the  selected  horizon  is  too  short,  then 
change  detection  power  is  adversely  affected.  Figure  21  provides  another  perspective  on  the 
power  fraction,  where  its  values  are  shown  for  different  horizon  settings  graphed  and  different 
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test  levels.  The  perspective  reinforces  the  property  seen  in  Figure  20  that  if  the  horizon  setting  is 
sufficiently  long  and  within  the  typical  operating  profile  of  the  machine,  then  the  power  fraction 
will  remain  near  or  at  its  maximum  for  all  reasonable  values  of  a*. 


Figure  20:  Fraction  of  units  for  which  change  is  detected  before 
failure  by  new  test  on  PHM  data  for  varying  horizon  settings 


Figure  21:  Average  power  fraction  given  by  new  test  on  PHM 
data  when  varying  the  alpha  level  for  different  horizon  settings 
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Figure  22  presents  a  three-dimensional  look  at  the  advanced  warning  provided  when  both 
horizon  settings  and  the  overall  test  level  is  varied.  It  demonstrates  that  choosing  a  horizon 
setting  that  is  too  short  will  not  only  reduce  power,  but  also  severely  cut  down  on  advance 
warning  regardless  of  the  overall  test  level.  Similar  to  graphs  of  the  power  fraction,  a  plateau 
appears  as  the  horizon  setting  is  increased,  which  further  suggests  that  it  is  better  to  use  an 
excessively  long  horizon  within  the  operating  profile  rather  than  a  short  one.  Within  this  plateau 
though,  there  is  a  distinct  ridge  which  sits  above  all  other  horizon  settings  as  a*  is  varied.  This 
suggests  an  optimal  horizon  setting  for  these  machines  and  that  similar  analysis  might  be 
performed  for  other  scenarios  to  determine  appropriate  optimal  horizon  settings  for  such  cases. 


Figure  22:  Average  advance  warning  given  by  new  test  on  PHM  data 
when  varying  both  the  horizon  setting  and  overall  test  level 


As  a  whole,  with  the  power  fraction  near  or  at  100%  for  optimal  horizon  settings 
regardless  of  a*  and  advanced  warning  of  at  least  75  observations  for  the  worst  and  1 10  for  the 
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best  combination  of  a*  and  horizon  setting,  this  online  extension  shows  great  potential  for 
successful  application  in  real-world  scenarios. 
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V.  CONCLUSIONS  &  OPPORTUNITIES  FOR  FUTURE  WORK 

In  this  paper,  we  build  upon  the  offline  ESPM  change-detection  test  of  Ruth  and  Koyak 
(2011)  and  introduce  a  new  multiple  testing-based  approach  to  online  multidimensional  change- 
point  problems.  This  approach  results  in  an  effective  online  change  detection  procedure  and 
portends  great  promise  for  future  real-world  applications.  Our  review  of  the  field  of  change 
detection  shows  this  to  be  an  active  area  of  research  with  many  potential  applications  and  that 
multivariate  nonparametric  approaches,  let  alone  online  approaches,  are  few.  Most  existing 
approaches  require  restrictive  distributional  assumptions  which  often  limit  real-world 
applicability. 

The  online  extension  we  propose  satisfies  two  primary  requirements  of  multiple  testing:  1) 
maintaining  overall  test  level  across  many  true  null  hypotheses  and  2)  achieving  reasonable 
power  against  false  null  hypotheses.  Assuming  that  the  dependency  conditions  of  Benjamini  and 
Yekutieli  (2001)  are  met,  our  test  meets  the  first  requirement  by  making  use  of  the  Benjamini- 
Hochberg  Procedure  (1995)  to  set  a  step-wise  adjustment  of  test  levels  based  on  the  number  of 
hypotheses  being  tested  and  the  desired  overall  test  level.  This  procedure  is  designed  to  control 
false  discovery  rate,  although  in  the  setting  of  our  interest  it  also  controls  family-wise  error  rate. 
To  address  the  second  requirement,  we  employ  an  existing,  powerful,  offline  test  -  the  ESPM 
test  -  and  introduce  a  method  of  ingesting  and  testing  data  incrementally  which  we  refer  to  as 
‘telescope’  testing,  where  the  testing  region  begins  with  an  initial  sequence  of  observations  and 
then  grows  in  size  as  new  observations  are  added  to  the  data  set  and  incorporated  into  the  testing 
region.  This  approach  demonstrates  that  our  proposed  test  maintains  the  desired  overall  test 
level  while  achieving  impressive  power  and  useful  advanced  warning  times  in  many  scenarios. 
Eurthermore,  our  method  of  extending  offline  tests  to  online  is  not  limited  to  extensions  of  the 
ESPM  test  only,  and  so  we  believe  this  has  promise  for  adapting  other  powerful  offline 
techniques  to  online  scenarios. 

With  this  project  coming  to  its  conclusion,  the  body  of  work  invites  several  possibilities 
for  further  exploration  and  extension.  One  opportunity  consists  of  finding  an  exact  null 
distribution  for  the  ESPM  test  statistic,  Kpj.  While  our  simulation  study  and  PHM  data  study 
demonstrate  the  efficacy  of  this  test,  the  absence  of  a  known  null  distribution  limits  the 
widespread  use  of  this  statistic.  Two  possible  acceptable  alternatives  to  finding  an  exact 
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distribution  would  be  1)  finding  a  result  which  bounds  tail  distribution  probabilities  for  Kj^,  or  2) 
finding  permutation-test  type  results  that  could  be  applied  to  the  ESPM  test  (and  thus  directly 
extended  to  our  test). 

Another  opportunity  lies  in  theoretical  work  in  finding  faster  solutions  to  the  optimal 
non-bipartite  matching  problem,  which  is  an  active  area  of  optimization  research.  For  an 
ensemble  test  that  requires  N/2  MNBMs  to  be  computed,  the  problem  of  speed  is  very  real. 
While  MNBMs  may  be  found  in  polynomial  time,  they  are  found  far  less  efficiently  than  other 
minimum-weight  subgraphs  mentioned  in  this  paper  such  MSTs  and  bipartite  matchings.  Ruth 
(2009)  reports  run  times  for  MNBM  ensembles  created  using  Derigs’  (1998)  algorithm  on  the 
order  of  N*;  our  experiences  confirm  this  to  be  true.  This  is  problematic  for  an  online  test  like 
ours  in  cases  where  N  is  very  large  and  data  are  sampled  at  a  very  high  rate.  More  efficient, 
possibly  even  suboptimal,  matching  techniques  would  alleviate  this  issue;  therefore,  it  would  be 
useful  to  study  possible  benefits  and  costs  of  using  computationally  cheaper  graph-theoretic 
approaches. 

Also,  our  review  of  change  detection  methods  briefly  mentions  the  use  of  different  cost 
functions.  Often  the  sample  space  of  interest  naturally  suggests  some  appropriate  dissimilarity 
metric  and  analysis  proceeds.  The  ESPM  test  makes  use  of  Euclidean  distance  as  a  measure  of 
dissimilarity,  but  a  different  cost  function  will  affect  detection  power  against  alternate 
hypotheses  when  using  ensemble  methods.  For  real-world  applications,  it  is  worth  examining 
which  cost  functions  lead  to  the  most  desirable  power  and  advanced  warning  characteristics  for 
the  case  on  hand. 

Finally,  since  our  real-world-type  scenario  used  PHM  data  from  an  international 
contest  for  statisticians  and  engineers  in  2008,  our  results  regarding  choice  of  window  size  are 
only  strictly  applicable  to  units  of  the  type  in  that  contest  (although  broad  observations  from  that 
study  likely  transfer  to  other  scenarios).  It  would  be  interesting  and  useful  to  apply  our  approach 
to  other  real-world  scenarios.  Areas  such  as  image  analysis,  machine  health  diagnosis  and 
prognosis,  biosurveillance,  and  quality  control  have  readily  available  data  and  provide  excellent 
opportunities  for  future  work. 
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GLOSSARY 

Acyclic  Graph  -  a  graph  that  has  no  cycles. 

Adjacent  Edges  -  two  distinct  edges  that  share  a  vertex  such  as  {v^,  V2]  and  {v2,  v^}. 

Adjacent  Vertices  -  two  distinct  vertices,  and  V2,  that  are  joined  by  an  edge  V2}. 

Bipartite  Matching  -  a  matching  where  the  vertices  are  split  into  two  unique  subsets,  5^  and  S2, 
and  each  edge  included  must  contain  a  vertex  from  each  subset. 

Circuit  -  a  closed  trail  that  includes  at  least  three  distinct  vertices. 

Closed  Walk  -  a  sequence  of  vertices  in  graph  G  beginning  with  u  and  ending  with  v  such  that 
consecutive  vertices  within  the  sequence  are  adjacent  and  u  =  v. 

Complete  Graph  -  a  graph  G  in  which  all  vertices  are  adjacent. 

Connected  Graph  -  a  graph  G  in  which  there  is  au  —  v  walk  for  every  pair  of  vertices. 

Cycle  -  a  circuit  that  repeats  no  vertex  except  for  the  first  one  equaling  the  last. 

Degree  -  the  number  of  edges  incident  with  vertex 

Directed  Graph  -  a  graph  G  in  which  edges  have  a  direction  associated  with  them  making  edge 
{^i>  ^2}  distinct  from  edge  {V2,  r’l). 

Graph  -  an  ordered  pair  G  =  (V,  E)  consisting  of  a  finite  nonempty  set  of  vertices  V  connected 
by  edges  e,  which  are  two-element  unordered  subsets  of  V. 

Graph  Weight  -  the  sum  of  all  edge  weights  in  a  weighted  graph  G. 

Incident  -  a  vertex  and  an  edge  that  meet  such  as  vertex  and  edge  {v^,  V2]. 

Independent  Edges  -  a  subset  of  edges  G  E  where  no  two  edges  are  adjacent. 

Matching  -  an  independent  set  of  edges  in  a  graph  G. 

Maximum  Matching  -  a  matching  that  has  at  least  as  many  edges  as  any  other  potential 
matching  in  G. 

Minimum  Spanning  Tree  -  the  spanning  tree  of  the  weighted  graph  G  whose  weight  is  the  least 
among  all  possible  spanning  trees. 

Nonhipartite  Matching  -  a  matching  where  each  edge  included  does  not  depend  on  any 
previous  partitioning  of  the  vertices  (a  vertex  can  be  paired  with  any  vertex  other  than  itself). 
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Perfect  Matching  -  a  matching  that  includes  every  vertex  in  G. 

Subgraph  -  a  graph  =  (Ki,  E-^)  of  G  =  (V,  E)  ifVj^EV  and  E-^  G  E. 

Spanning  subgraph  -  a  graph  =  (V-j^.E-J  of  G  —  {V,  E)ifV^  =  V  and  E^^E  E. 

Spanning  Tree  -  a  spanning  subgraph  of  a  graph  G  that  is  also  a  tree. 

Trail  -  a  walk  in  which  no  edge  is  used  more  than  once. 

Tree  -  a  graph  G  which  is  both  acyclic  and  connected. 

Undirected  Graph  -  a  graph  in  which  the  edges  have  no  orientation-  that  is,  edge  V2]  is 
identical  to  {V2,  v^] 

Walk  -  a  sequence  of  vertices  in  graph  G  beginning  with  u  and  ending  with  v  such  that 
consecutive  vertices  within  the  sequence  are  adjacent. 

Weighted  Graph  -  a  graph  G  where  there  is  a  real  number  expressing  some  form  of  interpoint 
cost  assigned  to  each  edge  in  G . 
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